First-syllable *ə in Proto-Mordvinic?

The following is, currently, more of a hypothesis I wish to record than an actual result.

Out of the two Mordvinic languages, Erzya shows the simple vowel inventory /i e a o u/ (plus a recent marginal /ɨ/ phonemicized by Russian loanwords). Moksha adds to this firstly an open front vowel /ä/, but also a reduced vowel /ə/ with front and back allophones. In noninitial syllables this corresponds to vowel-harmonic /e ~ o/ in Erzya, or in some dialects instead /i ~ u/. There are two main reconstructions of the Proto-Mordvinic situation: the Finnish/Hungarian approach, which posits Moksha-like original *ə, and the Russian approach, which posits Erzya-like original *i ~ *u. In terms of phonetic typology, the latter seems simpler from the Mordvinic dialectology viewpoint: *i ~ *u > /ə/ is trivial vowel reduction, while *ə > /i ~ u/ is rather less common, and also runs counter to typical vowel inventory trends in the region. [1] The former, on the other hand, seems simpler from the wider Uralic viewpoint: PMo *ə quite typically continues PU unstressed *a ~ *ä, and routing reflexes like *kota >> /kudo/ ‘house’ thru a stage *kudu with a close vowel appears unparsimonious. I have tended to follow the *ə reconstruction already since I mostly talk about Mordvinic within the Uralic context. A second motivation that appears reasonable to me are Erzya dialects where PMo *e *ä yield /ä e/ (minimal pair: /käď/ ‘skin’, /keď/ ‘hand’ ~ Mk. /keď/, /käď/ respectively), a “flip-flop” that seemingly demands some feature in addition to height for distinguishing these. We could posit that *e, *o were, at least phonetically, reduced vowels *ĕ, *ŏ, which would then also suggest that *ə was their unstressed neutralized allophone.

But most of this seems to be further complicated by a look at initial-syllable /ə/ in Moksha. This most typically corresponds instead to /i/ and /u/ in Erzya, including in dialects with /e ~ o/ corresponding to Mk. non-initial /ə/; sometimes we even find both close vowels represented in Erzya dialects; relatively often Uralic sources of such vocabulary would predict **e or **o; sometimes we find loss of the vowel altogether, either in just Erzya or also in Moksha dialects. A few examples:

  • Er. /kirta-/, /kurta-/ ~ Mk. /kərta-/ ‘to singe, scorch’ < PU *kor(p)-tta- (predicted PMo **kurtə-);
  • Er. /turva/ ~ Mk. /tərva/ ‘lip’ < PU *turpa (predicted PMo **torva);
  • Er. /troks/, /truks/, /turks/ ~ Mk. /tərks/, /turks/, /truks/ ‘across, thru’ < PU *tora-ksə (predicted PMo **turəks)
  • Er. /srado-/, /strado-/ ~ Mk. /səradə-/ ‘to be strewn’ < PU *sira- (predicted PMo **sora-).

Generally I’ve seen the /i/ ~ /ə/ and /u/ ~ /ə/ correspondences explained thru new secondary vowel reduction in Moksha. But this really fails to explain why we should have any doublets like /kirta-/ ~ /kurta-/ within Erzya as well. Given this and the cases of syncope, my current hypothesis is that perhaps we should be treating Moksha /ə/ as older, already Proto-Mordvinic, and the Erzya full vowels as secondary. This would obviously confirm that unstressed /i ~ u/ in Erzya also has to be secondary compared to Moksha /ə/; but this comes at a cost: it would also seem to mean that we now have some reason to suspect a contrastive Proto-Mordvinic *ə at least in the first syllable. Many, though not all, cases of such an *ə seem to be further followed by a full vowel /a/. Stress retraction onto full vowels is typical in the region, and so instead of setting up a new vowel quality contrast, a stress contrast might be possible: *tərvá = */tOrvá/ for ‘lip’, versus e.g. *tólga (= Er Mk /tolga/) ‘feather’. Non-initial stress placement like this is in fact attested from both Erzya and Moksha. — But then what of cases like ‘across’? Would we also need to set up contrasts like *təróks = */tOróks/, versus *mórə = */mórO/ ‘song’ (> Er /moro/ ~ Mk. /mor/)? Or even, since reflexes like /turks/ also occur (but not ˣ/turoks/, ˣ/təruks/ etc.), do we perhaps need to set up a syllabic *r̥ here??

All of this should be also further compared with words showing syncope in both Erzya and Moksha. If first-syllable *ə was allowed in Proto-Mordvinic, it seems quite possible to me that words like Er. /pŕa/ ~ Mk. /pŕä/ ‘end, head’ < PU *perä (predicted PMo **piŕə) should be reconstructed not just yet with an initial cluster, but rather as something like PMo *pəŕa, and with syncope only incidentally taking place in both languages later on in this kind of auspicious positions, i.e. where syncope would produce a typologically natural initial consonant cluster (the same environment as initial-vowel syncope in Udmurt).

[1] I would propose solving this by routing the /i ~ u/ dialects thru the mainline /e ~ o/ type: after “de-reduction” of *ə to full vowels, these dialects would have gone thru vowel reduction again, but this time not of the centering but rather inventory-reducing type: unstressed *e > /i/, *o > /u/. This is well paralleled by unstressed /e/ × /i/ > [ɪ] in Russian, which of course has been the most significant contact language of Erzya for the last several centuries already.

Tagged with: , , , , ,
Posted in Reconstruction

No mid vowel dissimilation in Greek — nor Finnish?

I recently read “Deconstructing ‘height dissimilation’ in Modern Greek” (Journal of Greek Linguistics 3, 2002) by Julián Méndez Dosula. I don’t really dabble in Modern Greek dialectology, but this struck me as an interesting paper for its methodology regardless, and the lessons seem to apply also more widely.

The story goes: Modern Greek varieties often reflect Ancient Greek /ea eo/ as /ia io/, and while AGk /oa/ was more rare, it can be also reflected as ModGk /ua/. [1] This has traditionally been explained to have come about a process of height dissimilation: [mid] + [non-close] > [close] + [non-close]. JMD however argues for a different pathway. Using /ea/ for illustration, the first stage would rather have been coalescence to a diphthong /e̯a/, followed by unconditional raising of the nonclose nonsyllabic to give /ja/ — both reflexes also attested among the palette of ModGk dialects — and finally re-breaking to /ia/. His main objection is that mid vowel dissimilation seems to be phonetically unmotivated, that explaining it as a means to prevent syllable contraction is too teleological, and that this explanation makes no sense anyway for dissimilation feeding into glide formation (which is the traditional routing of varieties showing /ja/).

I am fully on board with this kind of an approach. It is my experience that dialectologists quite often (1) operate on an assumption of deriving modern dialects directly from a classical/standard variety of the language, and (2) do not have a good knowledge of comparative linguistics besides their own subject. Because of this they can end up proposing all kinds of historically backwards and/or phonologically nonsensical reconstructions or sound changes. Two examples from elsewhere would be alleged /q/ > /ɢ/, /g/ in Arabic dialects (surely rather an earlier split with something like (*kʼ >) *k̰ˤ > *q̰ > /q/ in Classical Arabic versus *k̰ˤ > *q̰ > /ɢ/ > /g/ or *k̰ˤ > *k̰ > /g/ dialectally) [2] and alleged conditional /aɪ aʊ/ > [əɪ əʊ] in Canadian English (surely rather Early Modern English *əɪ *əʊ being positionally retained and only conditionally lowered to /aɪ aʊ/).

If this alone wasn’t enough though, JMD covers also plenty of indirect reasons to prefer a glide formation + breaking pathway. From the Greek dialect data we have the following points:

  • While mid + mid /eo/ can develop to /io/, the sequences /ee/ and /oo/ [3] do not develop to **/ie/, **/uo/, and they instead generally show contraction to simple /e/, /o/).
  • Glide formation explains concomitant stress retraction from e.g. /éa/ to /iá/ in some dialects, and also “regular hypercorrection” from e.g. /iá/ to /ía/ in others; or per JMD rather: stress advancement upon the re-breaking of /ja/ to /ia/.
  • Re-breaking explains the history of dialects where e.g. /ia/ (from earlier /ea/ or not) appears to have given /ja/ only after “palatalizable” consonants, into which the glide is then absorbed; i.e. /nia/ > *ɲja > /ɲa/, but /ðia/ remains unchanged. Per JMD, the latter rather gives intermediate *ðja as well, but reverts to bisyllabic after *ɲj > /ɲ/ coalescence has applied.
  • Also in varieties where the bisyllabic realization remains prescribed, sequences starting with a mid vowel can parse as a single syllable in poetry, and phonetic diphthongs such as [e̯a] can observed in connected speech.

As two additional typological arguments, he notes that mid vowel dissimilation, i.e. raising only before open or non-close vowels, is not well-attested as a synchronic phonological process, and that diphthongs do show a strong cross-linguistic tendency towards fully close endpoints. [4]

I didn’t catch this point being made particularly explicitly, but also linking /e̯a/ and /ja/ diachronically together additionally seems like increased economy over the traditional assumption of two unrelated coalescence processes along the lines of /e̯a/ < /ea/ > /ia/ > /ja/.


This all naturally makes me wonder about Finnish, where mid vowel dissimilation is a classic dialect feature, applying to unstressed /ea eä oa öä/ sequences. These primarily come about following elision of earlier unstressed *-ð- and are primarily found in four morpholexical environments: adjectives in -eA; partitive singulars in -A of nominal stems in -e-, -O-; infinitives in -A of verb stems in -e-, -O-; [5] “contracted” verbs in -A- derived from stems in -e-, -O-. All of these yield /ia iä ua yä/ in a variety of dialects, maybe best known as a feature of South Ostrobothnian, but also attested further north; in a small area in the southwest; and a slightly wider area in the southeast. [6]

Let’s first take a moment to consider if in Finnish, too, the /ia/ type reflexes could have actually followed the same /e.a/ > /e̯a/ > /ja/ > /i.a/ trajectory that JMD argues for modern Greek. Just as in Greek, the intermediates could be partly attested: /ja/ is known from a few southwestern and SOstrobothnian varieties, and some eastern varieties show /ea̯/, trivially close to more hypothetical *e̯a. (These are generally dialects that also show /oa̯/ and /eä̯/ for earlier unstressed *aa and *ää, and in principle one could propose that /eä oa/ actually first assimilate to *ää *aa; but for /ea̯ öä̯/ an explanation like this isn’t possible.) It is also the case that /jV/ > /iV/ under some particular conditions is a widely-distributed sound change in Finnish, e.g. /vjV/ > /viV/ in kavia ~ kavio ‘hoof’ < kavja ~ kavjo < ⁽*⁾kapja. I already think this might apply also in more cases than has usually been realized, and perhaps we could go further still and even assume developments such as korkea > korkja > korkia ‘tall’. Three-consonant clusters like /rkj/ would be rather strange to most Finnish dialects however.

There are also some adjective doublets that could be taken to suggest /eA/ > /iA/ > /jA/. Directly attested are at least eheä ~ ehjä ‘whole’, norea ~ norja ‘pliable’, sorea ~ sorja ‘beautiful’. Similar alternation can be reconstructed also behind at least lakea ~ laaja (< *laɣϳa < *lakja) ‘wide’ and välkeä (← *väleä by suffix exchange) ~ väljä ‘loose’. I am far from certain though about explaining these as phonological doublets. The variants in /-jA/ can be found also in dialects where the soundlawful development is /eA/ > /ee/, e.g. ehjä is found all across Tavastian dialects, and penetrates fairly well into Savonian dialects as well. In at least two cases this alternation even appears just within Karelian, where there is no sign of *eA > ˣ/iA/: kahei (Livvi) ~ kahja ‘coarse, rough’, karie ~ karja ‘coarse, big’. The latter indeed seems to be a specialization of Proto-Finnic *karja ‘cattle; multitude’, i.e. not a secondary development from a **kareda > **karea. My working hypothesis remains that this is mostly a kind of phonetically motivated morphological analogy, and that the forms in /-CjA/ are generally more original.

A final problem is that unlike Greek, Finnish has also original /-CjV/, as in *karja above. Some development to /-CiV/ can be found, but not in all cases. E.g. in SOstrobothnian /-ljV/, /-rjV/ > /-liV/, /-riV/ is regular, but /-hjV/ rather receives an echo vowel, e.g. pohja > pohoja ‘bottom; north’, tyhjä > tyhyjä ’empty’, clearly distinct from e.g. kauhea > kauhia ‘terrible’.

So a coalescence + re-breaking hypothesis runs into a variety of trouble in Finnish. I still would not want to just abandon the argument about vowel height dissimilation being an unnatural sound change though. Another way to fix the situation is possible too: glide epenthesis, followed by raising conditioned by this new glide (a mechanism that JMD passingly reports from Dutch). Thus, I would propose /ea eä oa öä/ > (? [ee̯a ee̯ä oo̯a öö̯ä] >) /eja ejä owa öɥä/ > /ija ijä uwa yɥä/ > /ia iä ua yä/. This has the same benefits of better typological plausibility, and no major problems with intermediate stages. The intermediate /eja ova/ type is again attested, conveniently neighboring both the SOstrobothnian and the southeastern /ia/ areas. Better still, there’s even the benefit that all three changes can be independently attested in Finnish!

  • Glide epenthesis is a widely spread strategy of hiatus resolution in Finnish dialects, clearly especially in stressed syllables (where typically no further general changes apply); possibly also in unstressed syllables, i.e. in cases of the type *kataɣa >> kataja, SW katava ‘juniper’. These, too, might be at least partly epenthetic glides appended to *kata.a, rather than direct reflexes of *ɣ. (However, *-aða > *-a.a > /-aa/ appears to be exceptionless.)
  • Raising of unstressed /e/ to /i/ before /j/ is well-attested all across Finnish, e.g. in actor nouns from e-stem verbs (sure- ‘to mourn’ → surija ‘mourner’). (No similar change applies with a labial glide, though: sanova ‘saying’ never gives ˣsanuva. Some eastern dialects show instead labial coloring, e.g. tuleva ‘coming’ > tulova; perhaps a more natural effect of the labiodental glide [ʋ].)
  • Even today Finnish really shows no distinction between unstressed [ijV] and [i.V]: contrasts such as nauttia ‘to enjoy’ vs. nauttija ‘enjoyer’ are purely orthographic. Subphonemic variation between [u.V], [y.V] and [uwV], [yɥV] also appears, particularly conspicuous after stressed syllables (e.g. standard tauot ‘pauses’ is usually [tauwot ~ tawːot], not [tau.ot]).

This approach would also seem allow to explaining an interesting asymmetry in the small southwestern zone in Uusimaa, which shows only /ea/ >> /ia/ but no /oa/ >> /ua/ (rather /OA/ > /OO/). Here I would note that Finnish definitely has a phoneme /j/ anyway, but no /w ɥ/; maybe this resulted in /eA/ > *ejA but no epenthesis from /oa öä/ to **owa **öɥä. — A similar situation extends also to the southwestern dialects proper, which mostly show /ea/ >> /i/ but /oa/ >> /o/. The western Uusimaa dialects are already known for sharing also other features with SW Finnish, and to me it would seem the best to treat the former as an archaic sister group of the latter, not as an SW-influenced group of the Tavastian dialects (which do not form a single historical subgroup anyway). It seems that either *ia *oa or *ia *oo could be reconstructed as the typical pre-apocope reflexes in SW Finnish.

Altogether one very broad point this case study shows that while the phonological makeup of the Finnish dialects has been well-documented by now, the actual history leading up to them remains open to analysis.

[1] AGk /oe/, when not simply retained, gives however rather ModGk /oi/, or more exactly, the diphthong /oi̯/ = /oj/.
[2] An intermediate voiced stage for “*q” also explains why is Proto-Arabic *g spontaneously fronted to something like /ɟ/ or /dʒ/ in most varieties.
[3] I.e. bisyllabic [e.e], [o.o]; not to be confused with the AGk long vowels η ω /eː oː/ which I believe give short /i o/ in ModGk universally.
[4] I could quibble a bit with this last argument though. Certainly closing diphthongs such as /ai/, /au/ are ubiquitous, but it is not clear to me if close-to-open diphthongs like /i͡a/, /u͡a/ are actually substantially more common than mid-to-open diphthongs like /e͡a/, /o͡a/. But also variation between the two is common, and in all cases known to me, mid-to-open is moreover more archaic than close-to-open (thus e.g. Eastern Finnic, Western Mansi, Northern Samoyedic, several Samic varieties). This diachronic universal will be at least as good for the purposes of his argument, if not better, than JMD’s alleged synchronic universal.
[5] Verb stems in -e- are for some reason not covered by Kettunen’s dialectal atlas, perhaps since quite a few of them have instead consonant-stem infinitives, showing either assimilation of earlier *ð (pure- : purra ‘to bite’, tule- : tulla ‘to come’, mene- : mennä ‘to go’), late retention of *ð (näke- : nähdä ‘to see’), or blocking of lenition from *t to *ð to begin with (pese- : pestä ‘to wash’).
[6] The majority development, including modern colloquial Finnish and also most other Finnic varieties where deletion of medial *ð applies, is to instead contract these to long mid vowels /ee OO/, possibly followed by other changes such as diphthongization to /ie UO/ (thus e.g. Karelian proper) or shortening to /e o/ (thus e.g. Estonian).

Tagged with: , , , , , , , , ,
Posted in Reconstruction

Followup anti-etymology: ? *täCə ‘birch bark covering’

In the last post I parenthetically mentioned a PU root “*täsə (UEW: *tisɜ)” ‘birch bark covering for a teepee’. This has been previously reconstructed from very scanty evidence: Komi /tis(k)a/, Forest Nenets /tʲēt/ ([tɕi͡et]), Kamassian [tʰɤʔ]. The latter two point to a Proto-Samoyedic form *t¹ät¹, which per the Komi comparison would have to be equal to plain *tät (*t¹ stands for *t or *č, which cannot be distinguished without Selkup). Samoyedic sometimes seems to have irregular *ä for PU *i (e.g. *mäńä ‘daughter-in-law’), but I think this word does not need to be one of them: this can be also the inverse, with Komi /i/ secondarily from *ä, a development attested also in e.g. /ki/ ‘hand’ < *kätə.

UEW makes the same mistake, I think, in one other case too: Permic *li ‘sap, phloem’ ~ Kamassian [lēji] ‘sap’ has been reconstructed as PU *lijɜ, where *läjə or *läŋə would seem better to me (but unexpected retention of *l- in Samoyedic and the unexplained (suffixal?) final vowel leave me suspicious on if this comparison, too, is correct at all).

I realize today that the consonantism of my alleged *täsə requires more thought, however. This reconstruction as such should give voiced **-z- in Komi, not voiceless /s/! The variant with /sk/ is however a good hint that the word probably comes about thru some degree of suffixation. I can think of at least three options, none of them entirely unproblematic however:

  1. a PU root *tätə, continued directly in Samoyedic but suffixed to *tätə-ksə > *ti-s(k)-a in Permic, with regular loss of medial *-t-;
  2. a PU root *täsə, continued directly in Samoyedic but suffixed to *täs-kä in Permic;
    • but from early *ä-ä I would rather expect **ɤ or **e in Komi;
  3. a PU form *täkə-ksə / *täxə-ksə / *täwə-ksə, with the 2nd syllable regularly lost in both branches and the nominal suffix *-ksə reduced to *-t in Proto-Samoyedic;
    • but I would expect *-tə, as also found e.g. in *suksə > *tutə ‘ski’ or Jussi Ylikoski’s recent comparison of northern Samoyedic predestinative *-tə with the Finnic translative *-ksi.

If any further cognates were found elsewhere in Uralic, they should be able to help clarify the situation. Quick checkups of Mordwinisches Wörterbuch and Yhteissaamelainen sanasto and mentally going over the Finnish lexicon have all come up negative, at least. Common Ugric “*täŋɜ-tɜ” ‘quiver’ has some vague resemblance (birch bark is a reasonable material for quivers) but probably not enough. It’s also one of the cases with Ob-Ugric *ɣ ~ Hungarian g, which I think is a point against native Uralic origin, ditto **-tɜ which is not a known nominal suffix in Uralic.

Looking outside of Uralic will be a worthwhile check too. I am firstly reminded of Indo-European *(s)teg- ‘cover, roof’ (> German Dach, Greek (σ)τέγος, etc.), which would be a fair match for my third reconstruction as *tä{k|x}ə(-ksə). Routing a loanword into Samoyedic would require a reflex in Indo-Iranian or Tocharian though, and going by standard references neither of them seems to have any kind of a basic noun reflex of this root. The Uralic support is also much too shaky for me to consider any kind of ancient Indo-Uralic cognate status, in case this doesn’t go without saying. So no progress here either.

A better lead seems to be found towards the east. A quick lookover of Turkic has proven similarly unproductive; but in Tungusic we finally find *tüksa ‘birch bark covering for a house’, an exact semantic match with fairly close-by shape. The Komi word could be actually interpreted as a relatively recent loanword from the Evenki reflex /tiksa/. The sound substitution to /sk/ would be curious, as if recapitulating the Proto-Permic metathesis of inherited *ks, but this is really not any worse of a problem than the issues in the comparison with Samoyedic. Morphologically then this comparison indeed looks better! While Komi /-a/ is a known derivational suffix, it productively forms only adjectives. Bisyllabic nouns ending in /a/ are often instead loans, e.g. /ćarla/ ‘sickle’, /koba/ ‘spinning wheel’, from Turkic; /kaľja/ ‘type of beer’, /ľuśka/ ‘spoon’, from Finnic. — Komi and Evenki are not known as close neighbors, but both have been notable trade languages in western Siberia before the expansion of Russian, and a few other Tungusic loanwords in Komi have been already proposed as well.

It still would be good to have additional evidence for *ks → /s/ or /sk/ in loanwords into Komi however. The cluster /ks/ is not categorically shunned, and it can be found e.g. in /ɤksɨ/ ‘prince’ (probably ← Alanic, cf. Ossetic /ɐχsin/ ‘lady, princess’, though some details of transmission remain unclear).

I have also not managed to scrounge up any other etymology for the Samoyedic words. Regardless, going by the to the Komi ← Evenki loan hypothesis, I now lean towards not reconstructing this word for Proto-Uralic after all.

Tagged with: , , , , ,
Posted in Etymology

Probably not a valid etymology: *čäččä ‘birch bark’

The Proto-Finnic word for ‘birch bark’ was *toohi (consonant stem: *toohë-, partitive *tooh-ta), continued directly in Finnish and Karelian tuohi, Veps toh’. The southern Finnic languages mainly show derivatives: Votic toho, standard Estonian toht(u-), Võro tohk(o-), Livonian tū’oigõz (however EES reports a seemingly underived form tooh from someëlsewhere in South Estonian).

The usual etymology, known for closer to 150 years by now, has been to connect this with Latvian tāss, Lithuanian tošis of the same meaning. We could indeed derive *toohi from earlier *taaši or *taašə (your call on the age of the shift of final *-ə to *-i), which will be immediately easily compareable with an East Baltic *tāšis. Already given the abundance of Baltic loanwords in Proto-Finnic, versus the rarity of Finnic loans reaching Lithuanian, generally the assumption has been that this word, too, comes originally from the Baltic side.

There does not seem to be an immedate Indo-European or even Balto-Slavic etymology though. Basic etymological references suggest derivation from √teš- < PIE *tetḱ- ‘to cut’, but this at first looks like only a semantically vague root etymology. There may well be evidence to further support it in Baltistic literature… but how far would exploring an origin on the Uralic side go?

Looking only at Finnic, *toohi actually has a native enough look, paralleling nouns like *sooli : *soolë- *sool-ta ‘gut’ (from pre-PF *śaali < PU *śalə). In a wider Uralic context just one feature is unexpected: the long vowel preceding *h < *š. As per current understanding, the long nonclose vowels *oo, *ee in Finnic first arise by what I call Lehtinen’s Law: the lengthening of *a, *ä before sonorants in *ə-stems, while before obstruents they remain short. Retention is clearly supported before *k (*käki : *käke- ‘cuckoo’, *mäki : *mäke- ‘hill’, *näke- ‘to see’, *väki : *väke- ‘power’) and *s (*asë- ‘to be located’, *kasi : *kasë- ‘dew’). For *p, *t, *h there is only one example each (*käci : *käte- ‘hand’, *lähe- ‘near, close’, *läpi : *läpe- ‘hole, puncture’), but still no clear counterexamples. I’ve proposed that the stem type *CAATi > *CEETi in Finnic (likewise *CAACA > *CEECA) originates precisely thru IE loanwords, including *toohi < *taaši.

And there remains a little bit of room for dout. Interestingly the stem *lähe- does not go back to older *läšə-: instead it appears to be a case of *s > *h, per the evidence of forms like Fi. läsnä ‘near, present’, an archaic locative in *-nA that also finds an exact cognate in Mari *lĭSnə > Hill /lišnə/, Meadow /ləšne/ ‘near’. [1] So in principle there is an opening: it could be proposed that, for whatever reason, pre-Finnic *š also triggers LL.

Actual candidates for Uralic cognates would be still needed to get anywhere further with this speculation. Some pre-Neogrammarian sources (Castrén, Donner) have compared tuohi with forms for ‘birch bark’ like Udmurt /tuj/ (< Proto-Permic *toj), Tundra Nenets /ta͡e/ (< Proto-Samoyedic *təj), but a sound correspondence *h ~ *j is by modern standards clearly untenable and the vowels don’t play nice either. [2]

Something slightly better can be however found in Mansi, where ‘birch bark’ is *šääšə (South /šääš/, Central /šöäš/ ~ /söäs/, North /saas/). This has been compared with Khanty *siińć (Eastern) ~ *seeńć (Western) ‘id.’, but this probably cannot be correct due to the medial consonant mismatch. By comparison with Finnic though we could trace the Mansi word back to a PU form *čäččä instead. Everything other than the Finnic long vowel would regularly follow known sound laws: *ä-ä > *ää-ə, degemination and *č > *š in Mansi; *ä-ä > *a-ə and *čč > *h in (pre-)Finnic. For initial *č-, traditionally the Finnic reflex has been assumed to be *č- > *š- > *h-, but the evidence is not strong, [3] and *č- > *t-, parallel to clearly regular *-č- > *-t-, has been also proposed in recent times. So have we now managed to uncover the Proto-Uralic term for ‘birch bark’?


While this new etymology could formally reach at least a level of “nonprovably regular”, I still think it is not likely to be correct. There are at least four red flags…

– The first is of course the fact that we have also the option of a competing etymology from Indo-European on the table, even if this lacks the benefit of being semantically exact all the way down.

– The second I’ve already pointed at too: the hypothesis that *-Ašə > *-AAšə in pre-Finnic is not a good fit for the known historical phonological framework of Finnic. I do not expect any additional supporting future evidence to be findable either, as the only other Finnic stem of the shape *CEEhi is *voohi < *aaši ‘goat’, an obvious Baltic loanword, this time with good IE provenance (Lt ožys, Lv āzis < *āžis << PIE *h₂aǵ-). I still think it is legitimate to sometimes propose “nonprovably regular” sound changes, but this is on the condition that they should make sense phonologically.

– Third, there would be issues with relative chronology. While PU *čč ultimately gives *h in Finnic, it is not clear if there ever was a stage with *š that even could feed into Lehtinen’s Law. Kallio proposes for this instead a route *čč > *tš > *šš > *hh > *h that makes rather more sense to me. Finnic tolerates geminate *ss just fine, even if its etymological sources are scarce (in Proto-Finnic basically limited to the inessive ending *-ssA < *-snA and some onomatopoeia), and earlier on there probably would not have been any problem with a transient *šš either; while geminate *hh would be typologically a bit more unusual. (Anything like starting from *čäšä and proposing an ad hoc assimilation to *čäčä in pre-Mansi would not be progress either.)

– Last, a lexicological point. While Proto-Uralians obviously must have known the material, the big picture is that ‘birch bark’ is etymologically highly unstable across Uralic: basically every branch has its own term with no clear, unambiguous cognates. Picking any one as the actual primary PU term would be guesswork, and it is not impossible that there even was no single PU word for ‘birch bark’ at all, only some compound or analytical expression along the lines of *kojwan_karə ‘bark of birch’. For that matter, other more or less specialized ‘bark’ terms have wide variety too. This should not be a huge surprize, as terms for the natural enviroment are typical substrate vocabulary. The Baltic etymology for the Finnic word fits well enough in this pattern (it has indeed been remarked long since that terms for the natural environment are common among Baltic loans in Finnic). We can also at least hypothesize that similarly the discrepancy between Mansi *šääšə ~ Khanty *sii/eeńć could be due to the words coming from two related but different substrate languages of western Siberia; say, pre-Mansi #sɛčV ~ pre-Khanty #senčV. If everything else was in order, a binary comparison could be acceptable, but a comparison drawn from a large pool of candidates that still remains messy is evidence for the similarity of *taaši and *čääčə being accidental.

(Amusingly enough, several words for finished birch bark products have better odds of being reconstructible for PU; e.g. *d₂äŋäs ‘small box made of birch bark’, *küčä ‘drinking vessel made of birch bark’, *täsə (UEW: *tisɜ) ‘birch bark covering for a teepee’.)

So altogether my novel Finnic–Mansi comparison ends up providing more heat than light; it is not a hill I want to die on or even really risk getting injured on. Hopefully still worth putting out there as a humble blog post though. It might be a good illustration of the repeating dead ends and close calls that come up in daily etymological research, but which will be generally left invisible in published works. (And who knows! while I won’t be holding by breath, it’s always a possibility that someone will eventually discover some other way still to bridge the problems in the idea.)

[1] This is though the only example that would retain a trace of the old consonant gradation pattern *s : *h root-medially. Elsewhere we find this alternation only in suffixal gradation (e.g. nominals of the type *taivas : *taivahë- ‘heaven’, *kapris : *kaprihë- ‘deer’) while root-medial *s does not show any productive qualitative gradation (*pesä : *pesä-n ‘nest’, *asë-tta- ‘to place’, and not **pehän, **ahëtta-). This could rouse suspicion for exploring other directions of explanation, such that perhaps läsnä is actually in origin something like a haplological inessive *läšə-snä > *lä-snä, ditto Fi. dialectal lästä ‘(from) near’ a haplological elative *läšə-stä > *lä-stä rather than an archaic locative edit: ablative *läs-tä. Analogy with *tä-snä, *tä-stä ‘(from) here’ could be an option too. But this too would be an explanation relying on idiosyncratic, essentially ad hoc analogies that cannot be decisive.
[2] I’m not even convinced that the Permic and Samoyedic should be thought of as cognate with each other. UEW proposes *tojɜ, but neither *o > *o in Permic, *o > *ə in Smy nor retention of *-j would be regular.
[3] More exactly, the evidence for actually reconstructing *č- and not *š- to begin with in the proposed examples is not strong. E.g. there do not seem to be any examples where Finnic *h- would correspond to a clear affricate *c- in Samic, unless we start counting otherwise poor comparisons like PF *hanki (< *čaŋkə?) ‘snow cover’ ~ PS *cōŋoi (< *čaŋoi?) ‘id.’ — Even the general development of initial *č- in Uralic seems to me like it still requires further study; for example Mari too shows some evidence for deaffrication *č- > *š- vs. also some for retention as *č-.

Tagged with: , , , ,
Posted in Etymology

Phonology squib: Conditional *h-loss in Estonian

The history of Proto-Finnic *h provides several illustrative examples of the diachronic development of “laryngeal” consonants. The primary overarching pattern is a north(east)–south(west) cline of gradual loss. This demonstrates that *h-loss processes have arisen independently in multiple lineages, and in multiple layers in many of them:

  • Karelian, Ludian–Veps: generally retained in all positions
  • Northernmost dialects of Finnish: retained but with several metathesis rules
  • South Ostrobothnian Finnish: retained in most positions, but word-final *-eh has been analogically generalized to -es
  • Ingrian, most remaining dialects of Finnish, South Estonian: generally retained in the initial (stressed) syllable and following it, lost after unstressed syllables; additionally word-final retention in some SE
  • North Estonian: retained in the initial syllable and in *CVhV, lost after unstressed syllables and in *CVRhV
  • Votic: retained in *CVhV and *CVhCV, otherwise lost
  • Livonian: *h > [ʔ] (“broken tone”, “stød”) in *CVhV and *CVhCV, otherwise lost

Further detail exists still. One such case is standard North Estonian, where we find word-initial loss in several words. The traditional explanation attributes this to dialect borrowing. There are indeed North Estonian dialects showing complete loss of word-initial *h, so there’s nothing impossible about this. Dialect borrowing would be moreover partially paralleled by the example of word-initial /h/ in Votic, originating in Finnish and Ingrian loans (obvious also by other markers in some cases).

It however seems to me that in Estonian a clear sociolinguistic motivation for dozens of *h-less loanwords from folk language into the literary prestige standard is lacking. We can contrast this with the early development of the Finnish literary standard; despite having Turku as its initial seat of development, standard Finnish has generally shunned specifically Southwestern dialectalisms. Instead the effect has been to “dilute” the dialect of Turku towards standard Finnish and away from the rural SW dialects. [1] Also pronunciation respelling seems unlikely to be the main mechanism; people usually tolerate entirely silent letters quite well (there does not seem to be major pressure to respell kn- in English, h- in French or Spanish, lj- in Swedish, etc.)

I see instead evidence for a further conditional sound change: *h- is lost preceding another *-h- + a voiced segment, i.e. in *hVhRV, *hVhV.

The *hVhRV case has a good half a dozen examples and no counterexamples that I can find:

  • ahel ‘chain’ < *ahl < *hahla (cf. Fi. haahla; ← Germanic)
  • ehmes ‘fluff, down’ < *hehmes (← Baltic *šeusm-)
  • ihn ~ hihn ‘strap’ < *hihna (cf. Fi. hihna; ← Baltic)
  • ihne ‘stingy’ < *hihneh (← Baltic *šikšn-)
  • uhmer ‘mortar’ < *huhmari (cf. Fi. huhmar)
  • ühm ~ hühm ‘slush’ < *hühmä (cf. Fi. hyhmä)
  • õhv ‘heifer’ < *hëhvo (cf. Fi. hieho)

For the *hVhV case there is only one really obvious case + another that shows secondarily inserted medial *h:

  • iha sleeve’ < *hiha (cf. Fi. hiha)
  • ihuma ‘to whet’ < *hiho- < *hi.o- < *hijo- (cf. Fi. hioa, dial. hijoa)

but I suspect that also some cases of *h-loss from *hVhTC belong here, which may have lost their *h in the weak grade:

  • uht : gen. uha ‘swidden’ < ? *huht : *uha < *hukta (cf. Fi. huhta)
  • uhtma : 1PS uhan ‘to rinse’ < ? *huhtma : *uhan < *huhta- (cf. Fi. huuhtoa)
  • õhk : gen. õhu ‘air’ < ? *hõhk : *õhu (cf. Fi. hehkua ‘to radiate, emanate’; hohkua ‘id.’)
  • õhkama : II inf. õhata ‘to sigh, emanate’ < ? *hõhka- : *õha- (cf. the previous)

A general loss of *h also in *hVhTV cannot be the case, per hahk ‘gray; eider’. Potentially the vowel difference could matter, but I would not assume this by just one example.

Phonetically, dissimilation of h…h would be natural. But why should the identity of the following segment matter? I think that allophony of /h/ is involved: at least in Finnish there is variation between voiceless [h] (word-initially and before voiceless consonants) ~ at least partly voiced [ɦ] (between voiced segments). If this is or has been the case in Estonian, too, then we could assume that *[hVɦ] first assimilates to *[ɦVɦ], followed by *[ɦ] > ∅ word-initially.

The Central Finnic (North Estonian & Votic) innovation *Rh > *R, where *R ∈ {n, l, r}, could be also naturally routed thru an *[Rɦ] stage. This is not strictly necessary though, since there is no contrasting **th or the like.

— There is slight evidence also for another even more minor *h-loss sound law: juus ‘hair’ < *hibus and juuk ‘fine’ < *hiukka seem to involve *hiu- > *hjuu- or *hʲuu- > juu-.

I do not know offhand if any traditional Estonian folk dialects would follow exactly the *h-loss patterns I’ve identified here. Still, even just acrolectal standard (North) Estonian probably could have gone thru some sound changes all of its own, early on in its development.

[1] Partly this is also because major cities will draw in population from wider out than just from their immediate environments, as demonstrated in the Finnish case by the so-called “Tavastian wedge” dialects that arisen along the old Turku–Hämeenlinna road, in the parishes of Kaarina, Lieto, Marttila, Kaski etc. in the central parts of Finland Proper.

Tagged with: , , , , ,
Posted in Reconstruction

Nonregularity in North Caucasian

Due to a recent ZBB discussion I ended up re-reading Sergei Starostin’s A North Caucasian Etymological Dictionary Preface. This is one of the more worrisome cases of “Moscow School” phonological tarpits: there is no doubt about Northeast Caucasian being a valid family, and I would also think the relationship with Northwest Caucasian is sufficiently established… but the reconstruction the late Starostin advances for the family sure looks like it has too many bells and whistles, with features like six laryngeals that end up almost randomly reshuffled in the descendants, nearly all obstruents having a plain/geminate distinction orthogonal to phonation, or abundant *Cw clusters at all POAs other than labial. I count 132 basic sound correspondences plus some fifty-odd cluster correspondences. Even spread across two root consonant positions in 2300+ reconstructions, in a reconstruction scheme of this kind there are bound to be reflexes that aren’t actually well enough established.

Probably most fixes to this reconstruction would also have to be etymological. Likely there are correspondences representing areal loanwords rather than original inheritance, or correspondences used to stitch together unrelated vocabulary. Just checking for not-really-regular correspondences would be a good start though.

I’ve picked for a quick case study *pC clusters. These appear word-initially, supposedly evolving from certain *Cw clusters, in two far ends of the family: Nakh and Khinalug. The asserted sources are as follows:

  • *ff > N *pχ, Kh. /px/
  • *ćw > N *ps (Kh. /cʼ/)
  • *św > N *ps (Kh. /s(w)/)
  • *śśw > N *ps, Kh. /pš/
  • *cw > Kh. /ps/ (N *c)
  • *čw > Kh. /pš/ (N *č)
  • *xxw > N *pχ
  • *qw > N *pħ (Kh. /q/)
  • *qqʼw, ɢɢw > N *pħ (Kh. /qʼ/)
  • *χχw > N *pħ, Kh. /pχ/

Also a cluster *bʡ in Nakh has three origins asserted: *qʼw, *ʡw and *hw.

How many of these developments are actually regular once we look into it? Put in your bets now…

(1) Nakh *ps is found in five examples. Every single one of them has a different reconstruction! i.e. none of them can be considered regular. Besides the three expected cases of *ćw, *św, *śśw, there’s one of *cc’w (alleged regular Nakh reflex *t-) and one of *ćʼ with no labialization even (alleged regular Nakh reflex *cʼ). Tsk tsk tsk. For that matter, two cases have NWC cognates with a presyllable *pə-, supposedly a prefix. My bet would be that this is what really occurs in the Nakh examples too.

(2) A Nakh *pš turns out to exist in one example with *čʼw, whose regular Nakh reflex is allegedly plain *š-. (Maybe another likely prefix case?)

(3) Nakh *pχ is found in four examples; just one of *ff, so irregular in any case. There are no more than two initial and four medial instances of *ff reconstructed altogether. The other case of initial *ff- actually has a Nakh reflex too, but showing *ħ-! — The three cases of *xxw do not look that much better. NWC has *xw in two cases (and also for the *ff case), secondary *x́w in one, so this at least seems to work. Lak has one case of /xx/, one case of /xxw/ and one case of /šš/; the last supposedly by late palatalization from *xx … but, unfortunately, the one example of /xx/ occurs before /i/? Andic has one case of *xw, one case of *ɬw.

(4) Nakh *pħ rakes together a seemingly respectable 13 examples. But they diverge to nine reconstructions, of which most occur just once: *q *qw *qq *qqw *qʼw *χχw *pʼɦ. The last is a cluster type (obstruent + laryngeal) that seems to be relatively common in the proto-lexicon but is strangely not at all commented on in the Preface. As for the others, only the *qw and *χχw cases seem even expected. For the others the allegedly regular Nakh reflexes are *q > *q, *qq > *q/*ʁ,  *qqw > *q/*ʁ, *qʼw > *bʢ. (There is one appeal to labiality metathesis: *qarćʼwV > *qwarćʼV before *qw > *pħ? But this is itself clearly ad hoc rather than regular.)

Our last hope for Nakh *pC are thus the clusters *ɢɢw, *qqʼw; the first represented by four examples (one of them with also a laryngeal: *ɢɢHw), the second by two examples (one of them with a laryngeal). Starting with *qqʼw, and skipping over subfamilies reflecting only one instance, in Tsezic we have one case of *qʼw and one of *qʼ; in Lezgic, one case of *qʼˤw and one of *qʼw (respectively). Inconsistent secondary articulations are not the most major problem maybe, but then the latter etymology additionally requires metathesis from *tʼHalqqʼwV to *qqʼHwaltʼV in Nakh. — Moving to *ɢɢw (when’s the last time you heard of a language that has geminate voiced uvular stops, incidentally?): Tsezic has one *q, one *qw; Dargwa has one *ʁˤw, one *ʁˤ and one *qqw; Lezgic has one *qqˤ, one *qqʼˤ, one *qqʼˤw. One case has a presyllable *mu-, and it would be possible to speculate that actually this is the real source of the Nakh cluster.

(5) Nakh *bʡ is found in also respectable eleven examples (plus one word-initial one). Three of them are from *ʡw, which ends up reflected reasonably regularly: the reflexes also include two cases of Andic *ħ and one of *ħw, two cases of Tsezic *ħ, three cases of Lak zero, two cases of Dargwa *ħ, two cases of Lezgic *ʔw. A small ray of hope, maybe…

Four cases from *qʼw (three of them with also a laryngeal: *qʼHw) look promising too. But the distribution of these etyma is terrible: only Lak and NWC also reflect more than one of them. The former has one case of *w, one case of *qʼ; the latter has in both cases *qʼ, though the second one with a presyllable *p-, again casting doubt on analyzing Nakh *b as continuing *w.

In the waste pile of protoforms attested only once, we have *ʔw, *hw, *ɦw, *bɦ (with the *hw case showing a presyllable *ba- in NWC).

(6) A Nakh *bʕ appears too. One supposedly from PNC *wH, another two from PNC *bʕ (of which one case “with some metatheses and aberrations“). The latter two do have *pp in Lezgian.

(7) Khinalug /ps/ is found in two examples, one of them indeed *cw and the other *čw. For *cwaʡmV ‘bear’, NWC adds (is supposed to metathesize) a presyllable *mə-; maybe this is once again what’s really going on.

(8) Khinalug /pš/ is found in four examples, going back to *śśw twice, *čw once and also *chw once (I think that’s an alveolar affricate + laryngeal sequence?). Lak has /š/ in both cases of *śśw; NWC has a presyllable *pə- in one of them.

(9) Khinalug /px/ is attested just once; enough said.

(10) Khinalug /pχ/ is attested once word-initially from *χχw as promised, also once word-medially from a sequence *-waχχ-.

So the basic toll is: the Nakh *pC clusters regularly correspond to nothing whatsoever across Northeast Caucasian. Only three of the eight alleged regular sources are actually regular even from PNC to Nakh (“soundlawful regularity“, one of the weakest types). For *bʕ we can find a weak two-example correspondence with Lezgian *pp, for *bʡ one just barely more substantially regular set of correspondences. Khinalug /pš/ finds one two-example correspondence with Lak /š/.

This survey does not fill me with hope for either the current proposals being correct or for the ability to find new, stronger phonological solutions with future work. Probably this is bound to happen to some extent in comparative work between languages with highly complex phonologies. I however wonder now just how much else does this result apply to.

Tagged with: , , , ,
Posted in Commentary, Methodology

Secondary apocope in Mordvinic

According to usual understanding, the Proto-Uralic stem vowel contrast *-A | *-ə is still continued in the Mordvinic languages in nominals of the shape CVCV: word-finally *-A survives as a vowel (mostly *-ə, in some cases *-a), while *-ə is lost. This basic rule can be demonstrated easily enough. A particularly clean minimal pair is *pälä ‘half’ | *pälə ‘side’ (still conflated in UEW), with the contrast continued in three branches (and recall that *ä-ä > *a-ə in Finnic):

Mordvinic *päľä ‘half’ *päľ ‘side’
Mari *pelə ‘half’ *pel ‘side’
Finnic *pooli ‘half’ *peeli ‘side’

Some more examples, including a second minimal pair *kerä | *kerə:

*kämä > *kämə ‘boot’ *lämə > *ľäm ‘soup’
*enä > *ińə ‘big’ *sënə > *san ‘vein, sinew’
*puna > *pona ‘hair’ *unə > *on ‘sleep’
*kerä > *kiŕə ‘ball of yarn’ *kerə > *keŕ ‘bast’
*pesä > *pizə ‘nest’ *kusə > *koz ‘cough’
*ćëta > *śadə ‘100’ *kätə > *käď ‘hand’

 

There appears to be one general exception to this however. I’ve given above examples after medial nasals, liquids and obstruents. But after medial semivowels, it seems that also original *-A is lost, presumably after reduction to *ə. Three cases with *-jA and one with *-wä are clear enough:

  • Mk. /uj/ ‘brain’ << *ojwa ‘head’
  • Er. /ki/ ‘moth’ << *käjä (via PMo *kij?)
  • *koj ‘custom’ << *kuja
  • Er. /kijov/ ‘snake’ << *kEjə-wä (*-ä per Nenets /śib́ă/ ~ /śiẃă/ < PSmy *kiwä)
    (The other Mordvinic reflexes are more obscure. I’d presume development *kijəv > *kiju > /kju/, and lastly metathesis from this to produce the most widespread form /kuj/.)

A possible case with *-wa is *juv ‘chaff’ << ? *jowa ← Indo-Iranian *yawa-. [1] I suppose though that this could be also reconstructed as having been loaned as *jawə, followed by the early shift *a-ə > *o-a. The general vowel reduction *-a > *-ə also means that the development of secondary *-a is not actually directly evidenced in this stem type, and we could ask if the shift was not rather *a-ə > *o-ə. Still, lack of apocope in at least *ćalə > *ćola > *śulə ‘gut’ would seem to suggest that the stem vowel shift indeed extends to Mordvinic too, not just Samic.

Secondary apocope might moreover take place in *kuj ‘birch’ << ? *kojwa, not found as an independent word but probably continued in *kujmə ‘basket’ and Mk. /kujgeŕ/ ‘birch bark’. There is very little evidence to reconstruct *-a specifically, though; Finnic shows secondary *-u ~ *-o, the isolated Pite Sami word is probably loaned from Finnic, and Samoyedic *koəj (*kojə?) clearly does not indicate *-a. My suspicion is that this has rested only on the Mordvin /u/, given the older theory that *o-ə would generally give Mo. *o. But even starting from *kojwə, I would expect *-ə after an original consonant cluster *-jw- to resist primary apocope, [2] and to be instead lost after cluster simplification in the same wave of secondary apocope that targeted secondary *-ə from *-A.

— On the other hand, since *kuj- only occurs as a cranberry morpheme, maybe I have no reason to speak of apocope: this could be instead syncope, which appears to have operated in Mordvinic slightly more widely than apocope. Cf. e.g. *pizə-nə > *piznə > Er. /pizne/, Mk. /pizna/ as the diminutive of *pizə ‘nest’; *kajwa- → *kajwa-ma >> *kajmə ‘spade’, [3] *wajŋə*wajŋə-ma >> *vajmə ‘spirit, breath’ as close equivalents to the derivation seen in ‘basket’. (I do not have examples of syncope in first members of compound readily available, however.)

Unreduced *-a does not seem to have been targeted by secondary syncope, per at least *kuja < ? *koja ‘fat’.

Lastly there seems to exist “tertiary” apocope after /al/. PU *-ala, *-ëla give /-al/ in modern Mordvinic; but Witsen‘s late 17th century vocabulary of Moksha still had ‹kala› ‘fish’ for modern /kal/ < *kala, as well as ‹sala› ‘thief’ < *sala (vs. no other unexpected final vowels). This seems regular enough too, though I have no idea what the motivation for such an oddly specific sound change could have been.

Nominals of the shape *CVjə, *CVvə can occur in Mordvinic, but all old native cases seem to come about by the lenition of earlier *p or *k. They probably still had medial obstruents at the time of secondary apocope. (Theoretically also examples with *j, *v from *ŋ or *x might exist.)

  • *kopa > *kobə > *kuvə ‘bark’
  • *śepä > *śebə > *śivə ‘collar’
  • *jekä > *jegə > *ijə ‘year’
  • *śekä > *śegə > *śijə ‘catfish, burbot’
  • *tika > *tugə > *tuvə ‘pig’ (> Er. /tuvo/, Mk. /tuva/)
    (If cognate with Finnic *cika < *tika; I have some doubts about this comparison.)

An early loanword example of this type is *Ravə ‘Volga’ ← Iranian *Rahā, probably loaned as intermediate *Raɣə. (UEW’s comparison with Khanty *răwV ‘mud’ seems far-fetched, and the reconstruction with *ŋ completely unmotivated.)


There are other proposed etymologies too that would seem to show secondary apocope of *-A > *-ə. Several of these however look dubious in various ways:

  • ⁽*⁾oš ‘town’ ? < *woča ‘fence’
    — There are no regular correspondences whatsoever between Mordvinic and PU here! The expected reflex would be **učə. Probably an incorrect etymology; words for ‘town’ can come from ‘fence’ (thus so already in Ob-Ugric), but they don’t have to do so.
  • Mk. /luv/ in /käďluv/ ‘gap between fingers’ ? < *loma ‘gap’
    — Maybe better compared with Finnic *lovi : *lovë- ‘cleft, gap’ (this has been passingly suggested by Aikio) or taken as a semantic specialization of /luv/ ‘number, order, etc.’ < *lukə (thus Grünthal 2012). I wonder if the Mordvins by any chance finger-count by gaps rather than fingers themselves?
  • *čoŋ ‘foam’ ? < *čiŋa
    — Finnic *hiiva ‘yeast’ with unexplained long *ii seems likely to be unrelated (and has a loan etymology from Baltic *šīvas ‘gray’). A proto-form for just Mari and Mordvinic could be rather reconstructed as *šoŋə, or maybe one is simply a loan from the other.
  • *toŋ ‘kernel’ ? < *tuŋa
    — Perhaps better reconstructed as *tuŋə. There is zero other evidence for *ŋ > *m in Finnic, and *tuma ~ *tuuma, if related, could represent a derivative *tuŋ-ma (or even later *tuw-ma with *wm > m explaining the Finnish variant with a short vowel?) The morphology of this would be obscure though, *-ma usually forms only deverbal and locative nouns.

At least one seemingly unimpeachable case remains that I have no explanation for: *ur ‘squirrel’ < *ora(-pa).

[1] Not from front-vocalic *jewä, as reflected in Finnic *jüvä ~ *jivä, but rather a parallel loan. While this difference seems obvious, I think Holopainen 2019 is probably be the first major source to state this explicitly?
[2] The closest parallels for this kind of retention of *ə are *veťə ‘5’, *kotə ‘6’, where *ť/t are probably from earlier clusters (not necessarily traditionally assumed *tt, however), and PU *-ə is clearly indicated by Finnic and Samic.
[3] Misglossed in Uralonet as ‘to scoop’; the word is a noun, not a verb.

Tagged with: , , , ,
Posted in Reconstruction

Phonological Cores and Average Regularities

Some thinking out loud on the formalization of comparative and historical phonology.

As in most work I’ve seen on the topic, I presume that an etymological corpus of word comparisons has already been given, additionally also aligned segmentwise. [1] The usual question at this point is how to proceed with reconstruction. I however largely assume even this as as given. The main questions I would ask are: how much should we trust a reconstruction given for the data? How coherent it is internally to begin with, and how does it match against other reconstruction possibilities?

This is not a very relevant question for developing automatic reconstruction methods, [2] but better understanding of these issues will be practical in assessing existing proposals. Especially the ones that cover substantial amounts of data but are regardless disputed on every front, e.g. any variant of Altaic or Nostratic.

Foundations and Cores

The basic concepts of this post:

  1. A phonological foundation is a set of word comparisons where every sound correspondence is regular within the set.
  2. A phonological core is a minimal phonological foundation, i.e. a phonological foundation such that no strict subset of its word comparisons is a phonological foundation anymore.

Note that these definitions are only with respect to etyma, not with respect to the number of reflexes. A comparison of two-reflex etyma could be exactly as regular as a comparison of ten-reflex ones; a compact foundation comparing only two languages could be exactly as regular as as diffuse foundation comparing ten languages. For now, all correspondences still have to be regular between all applicable language pairs specifically. [3]

These concepts have been phrased purely in terms of sound correspondences. Actual reconstruction requires consideration as well, though. A few initial definitions for this:

  • A reconstruction is a set of word comparisons between at least three languages, with exactly one of them being a special type of language called a proto-language, with the following properties:
    1. Every comparison includes a proto-language cognate (called a proto-form).
    2. The proto-language is not given by external data, but can be adjusted at will.
      (I.e. this is the “operational” proto-language, not the inferred “real historical” proto-language. By this definition, Latin is not Proto-Romance, at most identical to Proto-Romance.)
  • A historical phonology is a partially ordered set of sound changes (I will not go here into rigorously defining a sound change) with the following properties:
    1. Sound changes are ordered with respect to one another only if they interact ≈ roughly: take the same segment as input or as conditioning.
      (I.e. we abstract away the difference between historical phonologies that differ only in the relative chronology of changes that do not interact. In the absense of other details, *śëta > *śata > sata and *śëta > *sëta > sata should be considered identical histories.)
    2. The bottommost sound changes start from the proto-language.
    3. The topmost sound changes yield the other languages as recorded in real data.
    4. For any sound change applying to all languages, there is at least one sound change postdating it that does not apply to all languages.
      (I.e. the proto-language is still indeed the last common ancestor, not merely any common ancestor.)

The latter could be better called a “comparative historical phonology”… since real historical phonologies often take additionally also loanword evidence into account when establishing relative chronologies. And we could also define internally reconstructed historical phonologies that replace condition 4 with a redefinition of the proto-language. Gotta learn to walk before running, though.

I have on purpose defined these two concepts without referencing the concepts in my first list. It is more profitable to instead treat these as orthogonal, and to speak of concepts such as a foundational reconstruction = a reconstruction whose underlying set of word comparisons, the proto-language excluded, is a phonological foundation. There are many proposed reconstructions, and we do not want to suggest that they are by definition regular in my highly formal sense! As seen below, perhaps they do not even need to be.


Some concepts in hand, let us now go over a simple example. One clean phonological core within Uralic is presented by the following four etymologies between Finnish, Northern Sami and Erzya:

  • Fi. kesä ~ NS geassi ~ Er. кизэ /kize/ ‘summer’
  • Fi. pesä ~ NS beassi ~ Er. пизэ /pize/ ‘nest’
  • Fi. kala ~ NS guolli ~ Er. кал /kal/ ‘fish’
  • Fi. pala ~ NS buolli ~ Er. пал /pal/ ‘bit’

(I have stuck here to the most widely spoken members of their subfamilies. The data could be easily also stretched to include further varieties, or rewritten as a comparison of Proto-Finnic, Proto-Samic and Proto-Mordvinic.)

We can easily see that everything is regular: every sound correspondence occurs at least twice — in fact exactly twice; cores with correspondences occurring thrice could only be put together from more holey data. Reconstructions would be easy to suggest too. A phonetically simple approach would be e.g. *kesä, *pesä, *kală and *pală, which is only mildly off from the usual thinking. [4]

However, the data here in effect only allows reconstructing two onsets *k-, *p- and two rimes that I have just called *-esä, *-ală. It does not establish any contrast between the individual segments in the rimes! This means that given just this data, we could also rewrite the rimes in more minimal forms such as *-ele, *-ale, and assume a number of conditional sound changes that apply in all or most descendants (e.g. *l > ⁽*⁾s / e_ in all, *e > a / aC_ in Finnish).

This hence already demonstrates that reconstructions should not be built up from “core reconstructions”: overly limited data leads to overly minimal reconstructions. A four-comparison core is not quite the smallest possible, [5] but obviously most of a realistic proto-language still cannot fit into one. Reconstructions with real phonological labels should probably wait until we have assembled larger phonological foundations — within cores, this work is adequately substituted just by the correspondence patterns themselves.

This phonological core is incidentally also a “semantic core“, with each of the four comparisons showing the exact same meaning in every language. This is probably also a desirable trait in phonological foundations in general, but then not strictly required by the phonological formal side of the Comparative Method.

Comparison Regularity

Using the concepts of phonological foundations and cores, I can now also define a few categories of word comparisons:

  1. A word comparison that belongs to at least one phonological foundation is regular.
    • A core comparison is a word comparison that belongs to at least one phonological core.
    • A regular adduct is a word comparison that belong to at least one phonological foundation, but does not belong in any phonological core (is not a core comparison).
  2. A word comparison that shows regular sound correspondences as established by a phonological foundation, except for one unique sound correspondence, is near-regular.
    • Given a reconstruction, a single near-regular comparison that does not contradict any soundlaw establishable from the reconstruction without this item is (phonologically) nonprovable; as is the new, once-exemplified soundlaw it requires.
    • Given a reconstruction, a near-regular word comparison that does contradict a soundlaw establishable from the reconstruction without this item (would necessitate setting up a new proto-phoneme) is an exception.
    • If there are two or more nonprovable comparisons, such that they are compatible with the same foundation, but only one of them can be added to the foundation as nonprovable (forcing the others to be exceptions), they are competing.
  3. A word comparison more irregular than a near-regular one (with at least two correspondences that are not regular) is simply irregular. We could distinguish further categories such as “2-irregular”, “3-irregular” etc. (with “1-irregular” being what I have just titled “near-regular”), but in practice the case seems to be that, for any sensible morpheme length, already 2-irregular correspondences are too weak to be very useful for linguistic reconstruction at all.

The first and third points of case #2 may sound confusing; in practice it means simply the case where we do not have enough data to establish what the regular reflex of a proto-phoneme *X in language L might be.

Note that a comparison may be at one stage with respect to one phonological foundation, at another with respect to another.


Continuing the previous example, my above-detailed core motivates hunting for other words displaying the same correspondences: Fi. p- ~ NS b- ~ Er. п-, Fi. -a- ~ NS -uo- ~ Er. -а-, etc. The current core is “closed” in the sense that adducing any one additional and different comparison from among the known (West) Uralic comparative material cannot produce a new foundation. Any new item will have to remain nonprovable until at least one other item has also been added. Purely in theory, it could accept a comparison such as Fi. kala ~ NS gilluo, mixing sound correspondences from different “slots”, which would perhaps prompt reconstructing something like *kålä for ‘fish’, *kälå for this. However, across the Uralic languages it happens to be the case that every complete sound correspondence (a correspondence pattern; see below) is strictly restricted to a particular position in the word. [6]

To pick out one new datapoint: Fi. vala ‘oath’ ~ Er. вал /val/ ‘word’ (with also Sami cognates, not found in NS though) can be seen to be regular save for the initial, i.e. it is nonprovable with respect to this core. Minimally it could be proven to be regular by also identifying a West Uralic **vesä. No such comparison is known though (and indeed no words ˣvesä, ˣveassi, ˣвизэ exist in any meaning at all in the three languages); hence more data still is required. One way to do it would be to adduce also the following two comparisons:

  • Fi. kesi ~ Er. кедь /keď/ ‘skin’
  • Fi. vesi ~ Er. ведь /veď/ ‘water’

These two allow reconstructing a third rime *-eti; vala ~ вал and vesi ~ ведь allow reconstructing a third onset *v-; and the rime *-ală and the onset *k- we knew already. Hence all is again in order. Notice that by now we have not only proven vala ~ вал to be regular: it is indeed as much as a core comparison, since also the set *kală, *vală, *keti, *veti constitutes a core!

As noted above, this new second core also only works between Finnish and Erzya. In any Sami variety, clear cognates exist only for *vală and *keti. They will remain nonprovable until we have adduced even more evidence to establish the regular Samic development of *-eti rimes, or more generally *-eCi rimes, and of *v-. However, forms like Southern Sami vuelie /vʉelie/ ‘a joik’ or Skolt Sami -kõtt ‘skin’ regardless already suggest what to look for. This evidence is not hard to locate either, e.g. in Fi. veri ~ NS varra (SS vïrre, SkS võrr) ~ Er. верь /veŕ/ ‘blood’. Though this in turn will send us on a lookaround to establish a few other things as regular, most prominently the development of *r in all involved languages and of *t in Sami; secondly also to verify the correspondences v- ~ v- ~ v- and ï-e ~ a-a ~ õ-∅ between our three Sami varieties that have come up so far. All doable too of course. What we can however see already well enough is how extending foundations is not a question of linear progress.

Segment Regularity

A methodological problem that emerges here, once variable amounts of languages per etymology are being compared, is that regularity for every pairwise comparison may be too much to demand. If e.g. between Finnish and Hungarian there is insufficient evidence to establish a correspondence such as s ~ gy as regular at all (the only example of this is ‘urine’: kusi ~ húgy < PU *kuńćə), but at the same time s ~ /ź/ and gy ~ /ź/ can be both established in comparison with Komi (or s ~ /ńś/ and gy ~ /ńś/ in comparison with Mansi, etc.) — is this not good enough? It feels to me that we should not have to choose between either Finnish, Hungarian, or the still quite regular ‘urine’ etymon to include in a good phonological foundation of Uralic.

Lone pairwise comparison is not good enough for everything, on the other hand. This would make it much too easy to set up some “straggling” members in datasets, such that Chuvash maybe has regular correspondences with Mari but not with the rest of Uralic.

Any pairwise segment comparison still only either is or isn’t regular, and I’ve already defined grades of regularity for an individual pairwise word comparison as well. Even further grades of regularity can regardless be defined, first for individual segments as considered across the entire dataset:

  1. Complete regularity: a segment whose every pairwise correspondence across a foundation is regular.
  2. Biconnected regularity: a segment whose graph of pairwise regular correspondences across a foundation is a biconnected graph (cannot be split into two independent graphs by the removal of one “bridging” language from comparison).
  3. Connected regularity: a segment whose graph of pairwise regular correspondences across a foundation is a connected graph.
  4. Soundlawful regularity: a segment whose every correspondence with the proto-language (hence in a reconstruction, not just any foundation) is regular.

Anything less than #4 is obviously no more regular at all, and instead at most semiregular: since we can attempt to provide a proto-form for every word comparison, gaps cannot be a problem for soundlawfulness.

#4 is, in fact, weaker than even #3. Assume that we had only two examples of the development of *k > /k/ in a poorly known (or just heavily divergent) language such as Muromian. This could arguably suffice to establish the reflex as regular. But then if these two etyma had no overlap in what other languages they have reflexes in, they would not establish any regular correspondence between Muromian /k/ and any other attested Uralic language. I believe it is even possible to create, between no more than three languages, a highly degenerate counterexample dataset that is soundlawfully regular but none of the pairwise sound correspondences are.

Another way to create a highly degenerate but soundlawfully regular dataset is to simply pool together two disjoint foundations — say, with data comparing Mari and Chuvash as one component, data comparing French and English as another. This would still suffice to show that *k- > /k/ is a regular sound change in each language (just not that it is the same *k in all cases…). This is clearly absurd as one proto-language though. I suppose global connectedness regardless of regularity should be required anyway.

In large datasets further grades between #1 and #2 could also prove useful. I do not have the intuition to immediately identify them, though (“triconnected” comes to mind as a naive proposal, but if it would really improve anything much is not clear to me [7]). Before that though, we can consider what are sensible options for datasets with only a small number of languages. For two languages, soundlawful regularity already equals complete regularity. For three, biconnectedness does the same. For four, a double triangle graph is a possible more than biconnected but not fully connected option. But then it’s also already vulnerable to stragglers: e.g. we can find regular correspondences between Swedish, Finnish, Karelian and Erzya, just not between Swedish and Erzya specifically.  More Finnic languages could also be added into the mix to create even more highly connected correspondence graphs that still have the same problem. To eliminate this problem, but not the case of *ńć in Finnish vs. Hungarian, perhaps it suffices to demand the existence of some regular correspondences between all pairs of languages (even if not all pairs of segments).


Did you notice an assumption that I have snuck in unsaid above? It is that pairwise segment correspondences could be linked together into single distinct graphs of a segment’s correspondences. Actually though, this is not trivial, and already constitutes some basic work towards a reconstruction. I can define a few concepts related to this too, while I’m at it. It gives also a a first dip into the topic of conditional sound changes and conditional sound correspondences.

  • Given a set of word comparisons covering at least three languages, and with at least some word comparisons not covering all languages, a correspondence pattern is a grouping of pairwise sound correspondences that assigns a reflex for every language and assigns the pairwise sound correspondences of one multi-language word comparison into the same group.
    • A correspondence pattern is fully attested if every language appears at least once within it.
    • A correspondence pattern is complete if every one of its pairwise sound correspondences occurs in at least some word comparison.
    • (I could again define also biconnected, connected etc. correspondence patterns as weaker options, but I am not sure if this is necessary.)
    • A correspondence pattern is well-supported if there exists a word comparison that displays every member of the correspondence pattern. We could also call this “1-supported”, and define “n-supported” as the minimum number of word comparisons that displays every member.
  • A pre-reconstruction is, in turn, a grouping of either binary sound correspondences (if between two languages) or correspondence patterns (if between more than two languages) by positional environments. This could be further split into a few subtypes too, e.g. conditioning by daughter-language phonetics or conditioning by proto-language phonetics. Already a single correspondence pattern, though, could also constitute a (fairly trivial) pre-reconstruction. — It should probably be demanded that a pre-reconstruction only unites correspondence patterns that have some overlap in their reflexes, not arbitrarily different ones.
  • An unlabeled reconstruction is a set of pre-reconstructions that covers every sound correspondence within a comparative corpus. (Two-language comparisons could be always trivially considered to be unlabeled reconstructions.)
    • An unlabeled reconstruction is fully reflected if every pre-reconstruction contains a fully attested correspondence pattern. (In the case of highly split correspondences, we might not want to demand this of every minor correspondence pattern.)
    • Likewise, an unlabeled reconstruction is well-supported if every pre-reconstruction contains a well-supported correspondence pattern; etc.

Note that while an unlabeled reconstruction covers the entire system of correspondences in a given corpus of word comparisons, pre-reconstructions are segmentwise, per one alignment “slot” at a time (and this could maybe use a better term; “unlabeled segment reconstruction” doesn’t strike me as progress though). Also, as we already established in the previous section, correspondence patterns cannot be simply classified as “regular” or “not regular”. They are “once-soundlawful” by definition, but not anything more.


Continuting working with the example from above, Fi. v ~ NS v and Fi. v ~ Er. в are sound correspondences; Fi. v ~ NS v ~ Er. в is a correspondence pattern that combines them, and moreover suggests also the existence of a third sound correspondence, NS v ~ Er. в. Once we observe that this correspondence pattern occurs exclusively word-initially, it can be combined with also a corresponding word-medial correspondence pattern (Fi. v ~ NS vv ~ Er. в) into a pre-reconstruction: Fi. v ~ NS v-/-vv- ~ Er. в.

There would be other options, e.g. to combine the word-initial pattern with a different medial correspondence pattern that it also overlaps with: Fi. v ~ NS ~ Er. в. Note that we usually choose the first option (and say that they reflect Proto-Uralic *w, while the second reflects PU *ŋ) primarily due to the greater phonetic similarity. It would be entirely possible to shift them around in the reconstruction, to claim that PU *w nasalises to *ŋ in Samic (etc.), but that there was also a segment *hʷ that occurs only medially and always lenites to *w or similar. This is in all respects exactly as regular as the usual reconstruction with *w and *ŋ; it only does worse in terms of how natural the required sound changes are.

It’s also possible to advance an objection against the demand for sound correspondences to overlap before they can be combined in the same pre-reconstruction, even if it is clear that often combining non-overlapping sound correspondences would create nonsensical pre-reconstructions. Suppose a proto-language had some prominent allophonic distribution, e.g. between word-initial voiceless *[t] and word-medial voiced *[d]; but, independently in all descendants (e.g. perhaps due to later sound changes such as *st > /t/ or *ð > /d/), /t/ and /d/ have become different phonemes. Then, even if we take phonological and not phonetic data as out input — word-initial t ~ t ~ t and word-medial d ~ d ~ d will end up being two different correspondence patterns with no overlap between them.

Is this a problem? Not necessarily. It seems to me that unifying these as the same proto-phoneme is not a task of reconstruction: it is a task of the phonological analysis of the proto-language. That is, we see that reconstruction outputs “allophonemes”, not phonemes, possibly even in the case where the input is phonemic data. Due to this, it can be often a good idea to also not use phonemic but rather similarly “allophonemic” input data. Suppose now a case where medial voicing of stops has remained purely allophonic in all descendants of a proto-language: in this case, the allophony rule could be still reconstructed, but only if we do not first eliminate it from the data by collapsing [t-] and [-d-] into /t/.

(Despite these examples being simplistic, it is also the case that identifying nonphonological contrasts in a reconstruction is often not trivial. One of the more surprizing adjustments to Uralic historical phonology over the last few decades has after all been the result that, while traditionally reconstructed *oo and *ee do seem to contrast with *o and *e — they do not contrast with *a and *ä, and in fact have massive overlap with them in their reflexes, even if the proposed phonological proto-values are quite different. The solution to this has also not been to set up [a ~ oo] and [ä ~ ee] in an original allophonic relationship, it has been to recognize *oo and *ee as later innovations exclusive to Finnic.)

Lastly, a few statistical measures of pre-reconstructions that I can think of, which might come useful eventually.

  • The multiplicity of the pre-reconstruction is the number of correspondence patterns it encompasses.
  • The split count = S of the pre-reconstruction is the number of phonological splits it tracks. If a language shows N different reflexes across a pre-reconstruction, the split count of this language for this segment is N-1; the total split count is then the sum of these across the dataset.
  • The expected multiplicity is 2^S. The real multiplicity can be often smaller, though, both due to similar conditioning in several languages, and due to gaps in the data where two conditioning factors by accident do not occur in any word (even if this would be theoretically possible). Some general positional considerations could be applied to calculate a better expected value.

Higher Regularity

Continuing on. Before we start adding too many near-regularities and irregularities on top of phonological foundations, it is worthwhile to consider how far we might be able to get with just them.

A naive guess could be that the best phonological foundation for some large family like Uralic consists simply of gathering as many phonological cores as possible, taking their union, and topping up with any regular adduct comparisons that fit into this system. I think this is probably a bad idea, though. I’ve bordered above on the problem that it is often possible to identify phonological cores that consist of loanwords. These can be not just loanwords to/from an outside source; they can be also inside a family, creating false correspondences. There might be also some small number of accidental cores out there, even. E.g. nursery words of the mama papa dada type will easily allow establishing a regular correspondence a ~ a between almost any languages in the world, and it would only take a few coincidences to end up being able to show their consonant correspondences to be regular too.

As established, one way to weed these off will be examining the big picture of the sound correspondences and demanding biconnected etc. regularity (essentially an argument from distribution). Another clear source of false positives though is that so far I have not been very strict in defining “regularity” to begin with: I’ve accepted mere recurrence of any kind as sufficient. Normally, two examples of a sound correspondence is actually only very feeble evidence!

My assumptions, previously unspoken, have been the following:

  • If a linguistic relationship is real, then most sound correspondences will recur, over and over, within and between different cores, and build up naturally in this way once we start considering larger foundations.
  • Sound correspondences come in an exponentially decaying longish-tail distribution, and that while some will end up recurring quite abundantly, most don’t.

The second is particularly because of conditional splits, which will divide any proto-segment across multiple correspondence patterns. Between all three of Finnish, Northern Sami and Erzya, there are some 40–50 examples known of the word-initial sound correspondence k ~ g ~ к, some 20–30 for the nextmost abundant examples like word-initial p ~ b ~ б (and it is not coincidental that neither of these consonants has been affected by any further conditional sound changes in any of the three languages); but for the most poorly attested regular correspondences, we indeed have to make with just two examples between just two languages, before fading into correspondences that are regular only when routed through some additional language, or regular soundlawfully but not by binary comparison, or only semi-regular, or irregular entirely.

Could we just require every pairwise sound correspondence to occur at least thrice, and then work with “3-cores” and “3-foundations” as the most reliable key evidence? This is probably possible between some closely related languages. I am however uncertain if there would exist any of these for wider Uralic at all. There definitely are not any neat and compact nine-item cores that look like *pala *pola *pula | *tala *tola *tula | *kala *kola *kula (analogous in structure to my four-item cores covered above). This is for two reasons: (1) given the “long-tailedness” of pairwise sound correspondences, it is unlikely to find many high-frequency correspondences co-occurring in a word comparison; (2) in Uralic in particular, word roots/stems are relatively long, 4–5 segments, which makes it even harder to find a word comparison that avoids all the rare-but-regular sound correspondences.

Maybe some other condition needs to be relaxed at the same time? E.g. counting things on the pre-reconstruction level instead. After we’ve identified a complementary distribution e.g. among the different Samic cognates of Finnish /v/, we could then recognize Fi. v ~ NS v- and Fi. v ~ NS -vv- as the same meta-correspondence, and so on forth… But this actually already pares things back to the level of mere soundlawful regularity: all soundlaws affecting some proto-segment are already encoded within the correspondence bundles of a pre-reconstruction, and only a phonetical label for the proto-segment is missing. And demanding more than two examples of a reflex is not too hard at all.

A better option is perhaps to instead use the fact that words are inherited as a whole. If a word comparison shows three highly recurring correspondence patterns and one more poorly attested but still regular one, the three first should also allow us to put more trust in the fourth not being accidental. We could even calculate the average regularity. To avoid high-frequency correspondences “covering for” too many low-frequency ones, though, this should also probably be the geometric mean, not the usual arithmetic mean.


It’s even possible to propose that wordwise average regularity (let’s abbreviate this to WWAR) should to some extent trump segmentwise regularity altogether. Consider again some case like Fi. kusi and Hu. húgy that is not perfectly provably regular. That we still “want to” relate them can be after all motivated also without reference to the other Finno-Ugric languages, or to any detailed semantic considerations, by how k- ~ h- is a highly regular correspondence. So is -i ~ ∅, though this is a bit too “morphological” to fully count. [8] u ~ ú is also attestable, if rarer [9] and without well-known conditioning factors.

Besides giving a natural way to incorporate nonprovable and exceptional correspondences into an “extended phonological foundation”, WWAR is a measure that has also a few further good features. For one it largely captures the fact that short CV and CVC comparisons are more vulnerable to chance resemblances. For two, inversely it allows putting a bit more trust in word comparisons involving consonant clusters, which often show some highly conditional sound changes ⇒ not highly regular correspondences. In a comparison like Fi. täysi : täyte- ‘full’ ~ Hu. tel- ‘to be full’ we can then rely on as many as four highly or at least reasonably regular correspondences (t ~ t, ä ~ e, s : t ~ l, i : e ~ ∅) and not have to worry about y ~ ∅ too much.

But I think that it is still also necessary to start with solidly regular foundations, since the frequency of a sound correspondence depends on the corpus of word comparisons. Adding kusi ~ húgy to a corpus of Finnish–Hungarian word comparisons, from the sound correspondence point of view, does not only add the one case of s ~ gy, it increases by one also the counts of the other three correspondences, i.e. makes them more regular still. This being the case, there could be a risk of “farming” some highly recurring correspondences from numerous exceptional or nonprovable word comparisons, and using these as the main workhorse carrying the reconstruction. This was one problem in Uralistics in the 19th and early 20th century: a good understanding of some of the stronger correspondences had been worked out (in particular among consonants), which were relied on to accept also all kinds of poorer correspondences (in particular among vowels). Similar examples occur elsewhere in the history of etymology too, I’m sure (insert cliché allegedly-Voltaire quote here).

Given a corpus of word comparisons that is known to include some crud, there should actually exist a sweet spot of a sort. Calculate the WWAR and also the average WWAR across the corpus; then prune the lowest-WWAR comparison(s) and see what happens to the average WWAR. Eliminating highly irregular crud should raise this metric. But also pruning everything down to just a single phonological core would leave the average WWAR at no more than 2. Somewhere, then, there will be a maximum average WWAR that will be in a sense the most regular sub-corpus that can be achieved. There can be multiple local maxima though (there definitely are “between” cores, again as per my above example with vala ~ вал), and I’d have to work through a larger example corpus in detail to see.

Defining WWAR for non-binary comparisons will be something to figure out later also. Would just covering all the pairwise correspondences work? Perhaps it does. E.g. we can note that the number of pairwise correspondences grows quadratically as the number of independent members in an etymological comparison increases, and so this metric would naturally capture the intuitive impression that widespread etymologies are stronger (increase average WWAR more) than narrowly spread ones are.

Etymological Leftovers

I can think of one further potential problem in approaching reconstruction primarily as collecting phonological cores. A particular etymology could be quite regular between a handful of languages, but not between others. Maybe some further cognate shows unexplained quirks, or in some language group there exists a proposed but very dubious maybe-cognate. This is very common across Uralic, probably in any deeper and wider language family really. How worried should we be if these cognates turn out to not fit into phonological foundations?

For a demonstration of the issue, a few examples from what I generally consider AAA-class Uralic vocabulary overall:

  • *ëla- ‘under’: Mansi shows *jal- instead of expected **ëël-.
  • *elä- ‘to live’: Mordvinic has for this meaning *eŕa-, irregular on every segment but phonetically fairly close regardless.
  • *enä ‘big’: Komi has /una/ ‘many’ instead of expected ˣ/on-/. Udmurt /una/ is in principle regular, but per Komi this may have been irregular *una and not **ona already in Proto-Permic.
  • *ďëmə ‘bird cherry’: Erzya shows /lʲom/ with unexpected /o/ and unexpected initial palatalization, Moksha shows /lajmä/ with unexpected intrusive -j-, and even a common Proto-Mordvinic form does not seem to be readily reconstructible.
  • *ipsə ‘smell’: Hungarian has íz with irregular /z/ (maybe nonprovable as a reflex of *ps in particular). Moksha has /opəś/, irregular on every segment except *p.
  • *jäŋə ‘ice’: Permic has *jë, with unexplained loss of *ŋ (which has parallels though, so technically regular) and an irregular vowel.
  • *jëxə- ‘to drink’: the labial vowel in Samic *jukë-, Finnic *joo- is not really expected and has no exact parallels.
  • *kajwa- ‘to dig’: Samic has *koajvō- instead of expected **kuojvē- or **kuojvō-.
  • *kälä- ‘to wade’: Mansi *kʷääl- has an unexpected labialized initial, Khanty *küüL- unexpected height and labialization of the vowel, instead of expected **kääl- and **kööL- (or **käL-).
  • *kätə ‘hand’: Mari has *kit instead of expected **ket.
  • *kiwə ‘stone’: Udmurt has /kɤ/ instead of expected /ki/ (which does occur in Komi).
  • *kulkə- ‘to go’: Hungarian has halad instead of expected ˣhol- or similar.

(Incidentally it is noteworthy that while there are some consonantal problems too, all of these cases show some vocalic problems.)

Regardless all of these etymologies show perfect soundlawfully regular reflexes in at least six other languages. At least the comparison of these is beyond any reasonable doubt. With these exception cases it’s however conceivable that some of them have in fact been adduced erroneously and should be treated as e.g. family-internal loanwords or as unrelated. [10]

There is also a smaller group still of completely and unambiguously clean widespread Uralic etymologies, including e.g. the above-considered *kala ‘fish’ and *pala ‘bit’. Should we perhaps prioritize these cases somehow when building up a phonological foundation? Maybe not. *kala and *pala both happen to lack known reflexes in Permic… If I were to propose some outrageously irregular reflexes from there, does this in any way weaken the other pairwise comparisons? The same really holds for more promising irregular reflexes too. As a reminder, the main point of the framework I am sketching in this post is to assess if a proposed reconstruction or system of correspondences is acceptable, or if it is better than some other proposal. That there remains more work to do is a different issue.

At other times still, a proposed etymology could have deeper fault lines, such as being more regularly considerable as two etymologies, possibly with a bridging member. These are also findable across Uralic, e.g. when western languages point to *kakta but eastern languages to *kettä as the proto-form of the numeral ‘2’. It is not clear to me what to do in such cases. They can still e.g. demonstrate branch-specific sound changes as long as we keep a leash on which pairs of languages are compared.

In any case, much like widespread sound correspondences, widespread etymologies are not all-or-nothing cases. They may be more regular between some languages, less regular between others. Single outliers or multiple equally distant ones will be easy to identify and possibly exclude at least. It is surely a problem if a proposed language family starts having substantial amounts of etymologies which only really work between a few languages and not any others, but this might well be a problem of etymological work and not of the relationship itself. It is hard to think of any formal justification for treating an irregular etymology as “too good to be rejected”. Substantial and intractable irregularity is a good reason to decide that a proposed cognate is just wishful thinking built on superficial similarity, or at very least too weak to build a foundation on, no matter how long e.g. its pedigree in etymological literature is. The best illustrations for this principle surely come from cases where a different, more regular etymology turns out to be possible after all. The classic of the genre is the superficial resemblance of Latin deus and Greek θεος. From Uralic, consider e.g. Livonian sūoŗ ‘vein, sinew’: by current thinking this is not a reflex of PU *sënə ‘id.’ (> Proto-Finnic *sooni, reflected in all the rest of Finnic) with irregular *n > *r (> ŗ /rʲ/), it is instead a perfectly regular reflex of a distinct but partly synonymous PU root *särä. As an older example I could mention the comparison of Fi. aivot ‘brains’ with Northern Sami oaivvi ‘head’, which appears in some 19th-century works before being replaced by the current comparisons: Fi. aivot ~ NS vuoigŋašak ‘id.’, Fi. oiva ‘proper’ ~ NS oaivvi (both fully regular even if less immediately apparent). Similar examples could be collected at least by the dozens from etymological literature. [11]


There is one failure mode of overcriticality around this area though; it is one where difficulties in reconstruction are confused with irregularity. E.g. as I’ve pointed out before, in Khanty the development of *kala, *pala differs from a third rhyme word *sala- ‘to steal’. But *ɬaaL- as the reflex of the third is not irregular in any sense I’ve defined so far! There are several cases of *a-a > *aa, hence also several cases of correspondences like Finnic *a ~ Khanty *aa or Samoyedic *å ~ Khanty *aa. The only problem is in our lack of understanding of the conditioning factors that lead to a double representation *uu ~ *aa. It would be possible to e.g. propose an entirely regular reconstruction of PU with two open back vowels, *a and *å, distinguished only in Khanty.

Where from here

Whatever the exact route, it would be a long and at many points tedious exercise to work up from small phonological cores all the way up to our current understanding of Uralic etymology and comparative phonology. This would be regardless illustrative, I think. If we repeated the process with a few other language families too, we might be able to eventually set up an objective metric for how phonologically (ir)regular some known or proposed language relationship really is. Also, just the largest achievable phonological foundation is probably not a good metric. My suspicion is that allowing any and all minimally regular correspondences, without constraints for their number, will lead to a vast ballooning of the system of correspondences that can take just about anything we throw at it (parallel loanwords, parallel derivatives, onomatopoeia…), and something like WWAR will be a much better metric of regularity.

There will be further technical issues to work out too, such as the effects of subgrouping and intermediate reconstructions (which could be used to define something like “phonological subgroupiness” also); the methods we use for identifying conditional sound correspondences; or adding typological constraints for the segment inventory of the proto-language or the sound correspondences we will tolerate (a correspondence like *n ~ *n should surely require less evidence to be acceptable than a correspondence like *m ~ *k).

[1] In language families such as Uralic, which I call “trochaic” or “left-rooted” (I should probably expand on this concept later on as well), alignment is really largely trivial: initial consonants or zero initials always correspond, first-syllable vowels always correspond, medial consonants and respective components of clusters always correspond, stem vowels in languages with bisyllabic roots always correspond. Complications start to arise only in corner cases like metathesis, initial-vowel syncope, or derivational suffixes added to CVC stems. Diphthongs and long vowels could provide some problems too, but then contractions like *ej > ii can be always also rewritten as conditional correspondences along the lines e ~ ii and j ~ ∅.
[2] Then again, what we are currently accomplishing with AI in fields other than linguistics suggests to me that automated linguistic reconstruction cannot be done right on the first try in any case. Any reasonably feasible algorithm most likely has to be based on generating a first pass and then iterating improvements to it. If we are good enough at the latter, it’s OK if the former is still fairly bad. This how real reconstruction also works, after all.
[3] In particular there are no phonological cores built out of comparisons covering only two languages but with the data altogether covering more than two languages: every two-language pair could be separated as its own core instead.
[4] What’s off is that the different treatment of the final vowels in Erzya is actually not due to any original difference in their strength, it is due to a recent and weirdly specific innovation syncopating *ə after *Cal. Unsyncopated forms have still been attested in Witsen’s 17th century records of Moksha.
[5] The absolute minimum is a comparison of two items with two segments that are the same in both, e.g. al, la in language 1 ~ er, re in language 2, or indeed, a comparison of two pairs of homophones.
[6] All single medial consonants are geminated in most of Sami in strong-grade positions (hence with sound correspondences distinguishable from initial consonants), all final vowels are lenited in languages like Mansi (hence distinguishable from initial-syllable vowels), all original consonant clusters are simplified or broken apart in Hungarian, etc.
[7] “Biconnected” can be taken to mean that between any two vertices, there are at least two mutually disjoint paths, or that the removal of any one vertex will not break the graph into two or more non-connected components. Upgrading these definitions to “three paths” or “two vertices” may not yield quite the same meaning for putative “triconnected” (clearly the former is stronger than the latter though).
[8] After all it has been proposed that in Finnish e-stem words, at least ones like this that have consonant-stem partitive singulars (kusta), only √kus- is really a part of the stem and -i : -e- is a prop vowel. Or even, -e- at least: another possibility, probably more provocative, would be to claim that -i is a nominative singular ending.
[9] Traditionally known proposals include pura ‘drill’ ~ fúr ‘to bore’, suippu ‘point’ ~ csúp ‘point’, survoa ‘to mash’ ~ szúr ‘to pierce’.
[10] E.g. relaxing semantics a bit, Mansi *jal could be compared also with *jalka ‘foot’ or #jülŋä ‘tree stump’ (though these only help with the *j-). For Hungarian íz (dialectally also éz), Finnish & Karelian eto- ‘to find disgusting’ seems like a promising direction of comparison.
[11] And perhaps they should be. Etymologies are most of the time cited in secondary and tertiary literature without the scaffolding of historical phonology that holds them up in the first place. I suspect this often leads to beginners and non-historical lingusts getting the false impression (or maybe rather, strengthening the natural folk-etymological impulse of thinking) that just similarity is good enough for setting up an etymology.

Tagged with: , , , ,
Posted in Methodology

Native initial clusters in Udmurt

Typological definitions of Uralic [1] just about always note the lack of native word-initial consonant clusters. While the literary standards have their share of IE-derived clusters by now, in rural dialects and the Siberian languages clusterlessness is common enough to this day. However, exceptions can be found in the other direction too, although they seem to be an understudied topic.

The most obvious offender is Mordvinic, which sports all kind of words like /kši/ ‘bread’, /kšńi/ ‘iron’. Perhaps in most cases these are IE loanwords in ultimate origin, but involving native syncope, in these examples from preforms along the lines of *kərsä, *kərtnä. Russian influence can be still suspected though, since apparently this syncope is mostly post-Proto-Mordvinic. Two illustrative examples: Moksha /kštralks/ ‘bobbin’ ← /kšťir/ ‘spindle’ + /alks/ ‘bottom’, cognate to Erzya bisyllabic /ščeŕalks/; Erzya /troks/ ‘across’, cognate to Moksha /tərks/, /turks/ (PMo *turəks?). But I do not think the details of the development of these has been worked out in full, and several cases built on native Uralic material can be found also, such as Er. /pŕa/ ~ Mk. /pŕä/ ‘end, head’ (< PU *perä), /pškaďems/ ‘to blow’ (~ Fi. puhkua, Komi /pušky-/ < *puš-kV-). I could also submit some new wilder etymological hypotheses: e.g. could /pra-/ ‘to fall’ be from *pda- < *pədá- < PU *puďa- ‘id.’ ??

(Edit 2020-09-14: cf. now a hypothesis on reconstructing a first-syllable *ə still in PMo.)


The precise history of the Mordvinic initial clusters would really be a fairly large research project. Before diving into it, a decent typological parallel and a much more tractable case study of natively arising clusters in Uralic seems to be provided by Udmurt. In the literary standard, consonant clusters outside of Russian loanwords are rare but still extant. They seem to have a slightly extended presence in the dialects also. I’ve almost never seen this fact explicitly pointed out however, it came to my full attention accidentally only this March, while reading Michael Geisler’s Vokal-Null-Alternation, Synkope und Akzent in den permischen Sprachen (2005, Veröffentlichungen der Societas Uralo-Altaica 68) which primarily treats V2 syncope.

Interestingly there is a fairly simple phonological rule behind the rise of initial clusters in Udmurt: /ɨ/ is lost in the position CɨCV₂, where V₂ is a full = non-/ɨ/ vowel (though I have no examples with /u/), if the result is a “legitimate” consonant cluster. Nearly all examples I’ve found adhere to this (see below for one clear + one possible exception), and I’ve also found no widely distributed counterexamples in underived roots. In derivatives from or inflected forms of CɨC or CɨCɨ roots, syncope could be expected to be mostly reverted / prevented by analogy of course.

There is more uncertainty in the details of what counts as a “legitimate” consonant cluster, as well as in how widely this rule is reflected in the Udmurt varieties (it is almost surely post-Proto-Udmurt). The data below is mainly from the intersection of Wichmann’s Wotjakischer Wortschatz and Csúcs’ Die Rekonstruktion der permischen Grundsprache, the latter taken into account to ensure I am indeed dealing with inherited Permic material and not recent loanwords / coinages entirely. Dialect abbreviations are G(lažov) (northern), S(arapul) and M(almyž) (central), J(elabuga) (southern), Uržym (MU) and U(fa) (southeastern).

The best-established cluster type is stop + /r/:

  • /dɨr/ ‘probably’ → MU /dɨrak/ ~ /drak/ ‘id.’
  • /kɨrɨ-/ ‘to dig’ → G /krem/ ‘dike’
  • /kɨre(d)ź/ (most dialects) ~ literary & U /kreź/ ‘traditional box zither instrument’
    (similar to the Russian gusli, Finnish kantele, etc.)
  • /pɨr/ ‘always’ → /prak/ (several dialects) ‘id.; straight’
  • /pɨrɨ/ ‘piece’ ~ U /pri/ ‘id.’
  • /tɨr/ ‘full’ → /tɨros/ ~ /tros/ (several dialects) ‘id.; many’

In the Wichmann+Csúcs data, there is also one example each of /pl-/ and /sl-/:

  • /pɨlaśkɨ-/ ‘to bathe’ ~ G MU /plaśkɨ-/
  • /sɨlal/ ‘salt’ ~ G /slal/

So no big surprizes so far, just falling-sonority clusters of a globally common type.

A very different case can be found in ‘rye’: /dźeg/ in literary Udmurt and almost all dialects, but with a bisyllabic byform /dźɨźeg/ ~ /dźiźeg/ in Uržym, which is clearly more original in light of the Komi cognate /rudźɤg/. [2] Also interesting is the Ufa form /źeg/, since in this variety word-initial *dź- normally gives a nonsibilant affricate [ɟʝ-] (= ďj in Wichmann’s transcription). I would hypothesize that this is not a case of *dź+ź losing the second member, but rather *dź+dź losing the first member, already before the lenition *-dź- > /-ź-/ that is found in most dialects of Udmurt; then this new *dź deaffricates in Ufa even initially, which nicely parallels it having also *dž- > /ž-/. [3] Of course most of this might be also simply some sort of haplology, rather than ever going through an actual cluster *dź(d)ź- at all.

Another haplologyish case is ‘eight’. Per /kɨk/ ‘two’, and Komi /kɤkjamɨs/, the pre-syncope Proto-Udmurt form of this must be *kɨkjamɨs (as reconstructed also by Wichmann). Only syncopated forms have been recorded though: /ťamɨs/ in most dialects, a byform with /ťj-/ in Glažov as the only hint that something is up. [4]

Another group still is built up by clusters of the type sibilant + stop/nasal. These demonstrate some “regression to the mean” — they tend to “de-cluster” again across the Udmurt dialects, but this time by epenthesis of an initial vowel: /i/, partly also /ɨ/. This of course leaves the cluster as such intact, but does break it into two different syllables. The Glažov variety appears to fairly consistently retain the elsewhere syncopated original vowel however, though possibly colored to /i/ by palatals. As for where an epenthesized form occurs or not, I see no pattern. Double representation is common, and probably both variants exist widely side by side, and the literary standard and in some cases Wichmann have randomly ended up sampling only one or the other.

  • G /sɨkal/ ~ literary, G J MU U /skal/ ~ S M /iskal/ ~ J MU U /ɨskal/ ‘cow’
  • G /sɨpaj/ ~ U /spaj/ ~ MU /ispaj/ ‘beautiful, good’
  • G /šɨnɨr/, /-ń-/ ~ literary, S M J /iɨr/, MU /ińšɨr/, U /iɨr/ ‘threshing ground’
    ~ Komi /rɨnɨš/ < *rɨŋɨš < *riŋəšə > Finnic *riihi
  • G /śińer/ ~ M /er/, MU /śńer/, U /šńer/ ~ literary, S J /iśńer/, M /iśńɤr/ ~ U /ɨšńer/ ‘broom’
    ~ Komi /jiś/; compound with /ńɤr/ ‘twig, rod’
  • G /śike/ ~ MU /śke/, /ske/ ~ literary, M J /iśke/ ‘so, thus’
    ~ Komi /eśkɤ/ ‘conditional mood particle’

Syncope-then-epenthesis would not be the only possible history for these, but this has support from the fact that both /ɨ/-syncope and pre-sibilant epenthesis can be independently attested, the latter in Russian loanwords such as U /smolla-/ ~ M /ismola-/ ‘to tar’  ← смола ‘tar’, G /šľapa/ ~ MU /iślapa/ ← шляпа ‘hat’, G /štop/ ~ J /ɨštop/ ‘jug’ ← штоф ‘a measure’, J /iźver/ ‘predator’ ← зверь ‘beast’. Note also the lack of epenthesis to **šiľapa, **šɨtop in G.

I assume ‘threshing ground’ has been further metathesized from expected *išnɨr by folk-etymological influence of /in/ (~ J MU /iń/) ‘place’. Why the Ufa form has /m/ I have no idea; that the word had proto-Permic and thus most likely also Proto-Udmurt *-ŋ- does not clarify anything. In ‘broom’ Komi seems to suggest original *(j)iś-, but as this has no further etymology, maybe this is rather a loan from Udmurt with the 2nd part dropped. Also, perhaps the first part of the compound is /śi/ ‘hair, bristle, fibre’ (also occurring with /ɨ/ in MU /ďɨrśɨ/ ‘head hair’)? A broom is indeed a ‘rod with bristles’.

In ‘so, thus’ is the initial vowel is clearly original however, as this comes from Volga Bulghar *ećke > *ićke (> Chuvash /əśke/), and hence requires etymological nativization in G.

The case of ‘cow’ then seems to have relevance beyond Permic even. This has known cognates a bit more widely, but these are also syncopated and partly even epenthesized! /skal/ in Mordvinic, /škal/ ~ /əškal/, /ŭ-/, /u-/ in Mari. UEW treats these as coming from a common protoform *uskalɜ with somewhat arbitrary loss of the initial vowel. However, if I am correct about the Glažov forms in /SIC-/ being mainly archaisms, then this is probably not correct: at least the Mari forms should be considered one or more loans from Udmurt specifically. The Mordvinic form could still have come about by parallel syncope. As commented by Bereczki (1992), retained /a/ in the original 2nd syllable most likely regardless indicates an areal loanword of some unknown origin. But now we seem to know that the shape of this source has probably been more like #sukal or #sikal than #skal or #uskal.

A sixth possible member in this group could be /ɨštɨr/ ~ M /ištɨr/ ‘footrag’ from a *štɨr < *šɨtɨr, as I suspect on grounds of the unmotivated /i/ in the Malmyž variant. This too has a Mari equivalent /štər/ ~ /əštər/ that would again have to be a loan from Udmurt. Furthermore these have been compared by Wichmann [5] even with Finnish (+ Karelian, Ludian, Veps) hattara ‘footrag’ < ? *šattara. The vowel correspondence a ~ /ɨ/ is rare and irregular though, so probably this is in any case not all the way from Proto-Uralic. The Finnic lexeme has also an alternate etymology as a semantic specialization of the homonymous hattara ‘fluff’.


‘Threshing ground’ and my proposal for ‘footrag’ diverge from the other examples by showing syncope from *CɨCɨC. Even these could be seen as kind of regular, once we consider the mechanism more carefully. The basic conditioning mechanism is surely not vowel quality per se, but rather stress. A typical feature across the more central Uralic languages (and also Turkic!) is a pattern where stress is still technically initial by default, but is widely retracted onto “stronger” vowels (long, full, open) later on in the word. In other words: syncope targets specifically pretonic /ɨ/. This would suggest that the immediate precedessor of hypothetic *šnɨr and *štɨr was more specifically iambic *šɨˈŋɨr, [6] *šɨˈtɨr, in contrast to trochaic stress on the more typical *CɨCɨ roots. If so though, it is too early for me to take a guess on what would have been the reason for such unexpected stress placement.


If there has been fairly regular loss of pretonic /ɨ/ in Udmurt, a natural follow-up question is: what about examples where this doesn’t lead to initial consonant clusters?

Two subtypes can be considered. The first would be aphaeresis: we would expect words of the shape *ɨˈCV(C) (where V ≠ ɨ) to again loose the first syllable and to give plain monosyllabic /CV(C)/. There perhaps are too some of these out there, since per Csúcs’ Proto-Permic vocabulary, it seems that no examples of this root shape have survived intact in Udmurt. The only examples of basic word roots with surviving word-initial /ɨ/ are all either monosyllables (/ɨľ/ ‘moist’, /ɨń/ ‘flame’, /ɨm/ ‘mouth’…), have /ɨ/ also in the 2nd syllable (/ɨbɨ-/ ‘to shoot, throw’, /ɨšɨ-/ ‘to be lost’…), or have an intervening consonant cluster that would not work as a legitimate word-initial cluster (/ɨrgon/ ‘copper’). /ɨ/ remains also in the compound /ɨbes/ ‘gate’ (from /ɨb/ ‘field’ [7] + /ɤs/ ‘door’) and a few inflected forms like /ɨč-e/ ‘such’. However, I also do not find any clear enough candidates where a Komi word of the shape /ɨCV/, /uCV/, /ɤCV/, /iĆV/ would lack an Udmurt cognate. The closest are Old Komi /idɤg/ ‘angel’ (whose hypothetical Udmurt cognate would be expected to be *ideg and not **ɨdeg > **deg), and Komi /ɨrɤš/ ‘ale’, a derivative from a lost verb *ɨr- (so perhaps derived only within Komi). The root shape *ɨCV(C) indeed seems to be lacking in Proto-Permic entirely. Would it be too bold to hypothesize that these have actually lost their initial vowel in both Komi and Udmurt?

One speculative etymology of this type could be ‘udder’. The more southwestern Uralic groups all use some sort of a loanword from Indo-European (F. *udar, Mo. *odar, Mari *wåðar). The Permic languages however have an unetymologized /vera/ ~ /vɤra/. If this came from earlier *ɨvɤra, perhaps it could be a part of the same group after all? But the final /a/ looks worrisome (‘udder’ seems to have been a consonant stem all the way from PIE to attested Indo-Iranian reflexes), as does wringing /v/ out of *-ð- < *-d- < *-t-, which normally lenites all the way to zero in Permic.

— The second possibility of difficult-to-detect *ɨ-loss is syncope before a zero medial. Udmurt has occasional bisyllabic vowel clusters of a relatively wide variety, e.g. /ju.a-/ ‘to ask’, /ju.ɨ-/ ~ /jʉ.ɨ-/ ‘to drink’, /ju.o/ ‘I will drink’, /ki.on/ (~ /kijon/) ‘wolf’, /lu.o/ ‘sand’, /na.a-/ ‘to look at’, /śi.ɨ-/ ‘to eat’, /vu.ɨ-/ ~ /vʉ.ɨ-/ ‘to come, come to completion’, /vu.em/ (~ /vujem/) ‘row’ (< *’order’ < *’completion’). There however again seem to be no examples of the shape Cɨ.V — or at least: none surviving as such.

I can also propose at least one actual candidate for this type of syncope, with a bit more confidence than the previous example even. In nouns we would expect this to yield a simple monosyllabic /CV(C)/ root. There are, however, no fully monosyllabic verbs in Udmurt! All have a stem vowel, in the citation form = the infinitive (ending in /-nɨ/) either /ɨ/ or /a/. This marks what at least some (maybe most?) grammars call the inflection class of the verb. [8] So what would happen if a verb of the shape *Cɨ.a- were to be syncopated to *Ca-? It seems to me that a likely outcome would be to pleonastically apply a second instance of class-marking /a/. This gives, I think, at least a good hypothesis for what’s up with the rather strange-looking /na.a-/ ‘to look at’ (attested only in the Besermyan dialect). This has usually been considered a reflex of PU *näkə- ‘to see’, but we would expect the reflex of this root in Permic to be rather *ni- (cf. /ki/ < *kätə ‘hand’) or perhaps *nɨ- (cf. /tɨ/ < *täwə ‘lung’). And my thinking is that this may have been indeed the case still in Proto-Permic, if this Udmurt form comes from earlier *na- < *nɨ.a-.


The attrition of initial consonant clusters from a language’s phonology can be observed in dozens of languages (Sinitic, Tibetic, Indic, Iranian, Armenian, Albanian…). Their introduction natively seems much rarer though; yet this process should be equally important for understanding large-scale shifts in typology. Other examples I know of are however mostly limited to a few way-out-there cases that clearly must have deleted vowels quite rampantly, e.g. Itelmen (Western /kɬfənʲck/ ‘in front’), the Okinawan languages (Amami /ʔkwa/ ‘child’, Ōgami /pstu/ ‘person’), or all the “sesquisyllabic” languages of SE Asia. Udmurt is in contrast a quite pleasantly tractable case, where only modest clusters have arisen in minor amounts. Yet, as the *ST- *SN- section shows, even these can throw up further complications. Some further cross-linguistic comparison with other cases would be interesting… if they first can be found somewhere. I suppose I already have Mordvinic lined up next. Another case I’ve seen reported is Central Dravidian, where the main rule seems to be a Slavic-esque liquid metathesis. But that’s about it for leads I have within the major Eurasian language families that I have the most knowledge of. Probably I would have to look into e.g. minor Niger-Congo or Austronesian languages (subfamilies even?) to find further cases where it is relatively sure that a language has definitely evolved from allowing only simple onsets to allowing initial consonant clusters.

[1] Not that there are any exclusive and pan-Uralic typological features; any kind of a “Uralic typological profile” immediately bleeds further east towards “Ural-Altaic” and/or “Uralo-Siberian”.
[2] This is surely in turn in some fashion from Germanic–Balto-Slavic (and → Finnic) *ruǵʰis, seemingly with either metathesis (PP *rudźeg < *rugedź, somehow via Finnic?) or a new velar suffix (PP *rudź-eg ← *rudź via BSl.?; loss of *-s is expected, cf. *pårś ‘pig’ from IE *porḱos).
[3] The Ufa variety clearly must’ve already had its /ďj-/ at this point too. I even wonder if this could be a hint that this is a retention, that Proto-Udmurt *dź- (< PP *dź- and also *r- / _VĆ, _V{s z}) was actually rather nonsibilant *ďj- or even a stop *ď-. Which could then have interesting further implications too, e.g. should this be perhaps applied even to Proto-Permic? Already for Proto-Udmurt and Proto-Komi, very few instances of *ď can be reconstructed, even fewer still for Proto-Permic, and only word-medially it seems.
[4] What’s also curious is that most varieties do not have a general shift *kj > /ť/, and palatalization seems to have taken place in this case only due to the cluster *kj having been forced to occur within one syllable. In some cases apparent palatalization can be found also medially, e.g. ‘to laugh at’: standard and most varieties /śerekja-/, but Uržym /śereťa-/ ~ /-kť-/ ~ /-ḱ-/. But then this variety also has general *j- > /ď-/ word-initially, and that’s probably also how the form with /-ť-/ comes from in the first place, not as *kj > ḱ > ť. (The /-kť-/ variant also suggests the same.) This pathway looks even clearer when compared with Ufa /śeregďa-/, where presumable intermediate *-kď- has assimilated in voicing regressively, not progressively.
[5] Wichmann, Yrjö. “Etymologisches aus den permischen Sprachen”. — Finnisch-Ugrische Forschungen 12: 128–138.
[6] While *ŋ > *n is regular in a few Udmurt dialects, including in Glažov when adjacent to /ɨ/ (*šɨŋɨr > /šɨnɨr/), it is not widespread enough to seem to me like the main explanation for why no /ŋ/ remains anywhere else either. What strikes me as more likely is that syncopated **šŋɨr was immediately adjusted to *šnɨr due to Udmurt not tolerating word-initial /ŋ/.
[7] A cranberry morpheme in attested Udmurt, but an independent lexeme in Komi, and per also a second derivative /ubo/ ~ /ɨbo/ ‘beet’ in Udmurt, this probably still existed in Proto-Udmurt too, perhaps up to the time of the syncope rule.
[8] Unlike e.g. Hungarian, or typical older Indo-European languages, this contrast does not affect the choice of endings as such, only the stem morphotax. At a pinch, consonant-initial suffixes are added to a vowel stem, either /CVCɨ-/ or /CVCa-/; vowel-initial suffixes to a consonant stem, either /CVC-/ or /CVCal-/. It would be possible to treat the contrast also as one between underlying consonant stems and underlying vowel stems (/CVC-/ versus /CVCa-/), with /ɨ/ and /l/ inserted as prop vowel and prop consonant when required, though these are not anything like general morphophonological rules in Udmurt (the default prop consonant is /j/). — /ɨ/ is also syncopated in some positions in some dialects, roughly according to what medial consonant clusters Udmurt tolerates in general. This creates a more “natural” look for verb inflection (and in these dialects we definitely should speak of consonant-stem verbs). As a tangent: contrary to what most reconstructions claim, I however do think that this is indeed syncope, and more consistent vowel-stem inflection is the Proto-Udmurt and probably also Proto-Permic state of affairs. If dialect forms like /karnɨ/ ‘to do’, /punnɨ/ ‘to plait’ (even if nicely paralleled by Komi /karnɨ/, /pɨnnɨ/) were to be soundlawful precedessors of more widespread forms like /karɨnɨ/, /punɨnɨ/… why do words like /kɨrnɨ(d)ž/ ‘raven’, /tunne/ ‘today’ then not also turn into ˣ/kɨrɨnɨ(d)ž/, ˣ/tunɨne/? At most vowel insertion could be analogical, and this then fails to explain why the distribution of /CVC-/ versus /CVCɨ-/ in the “consonant-stem dialects” is quite consistently phonologically conditioned.

Tagged with: , , , , , ,
Posted in Reconstruction

Etymology squib: *äńćä ‘(rasp)berry’

A repeating complaint I run into with the more impressionistic reconstructions found in the UEW is the frequent use of *ŋ as a kind of a deus ex machina phoneme, reconstructed for all sorts of confusing correspondences of nasal consonants. One offender is the word for ‘raspberry’, given as *äŋɜ-ćɜ. No reflexes are known from Samic, Finnic or Samoyedic, which often spells trouble for working out the overall root shape; and a lack of reflexes in Finnic or Hungarian in turn spells trouble for coverage in the etymological literature. All the remaining Uralic branches still have reflexes though, ranging from Moksha to Southern Khanty, so there is probably still something native Uralic in here, not just late areal loanwords (in Mansi this is regardless a clear Komi loanword).

A casual look over the reflexes indeed reveals a clear /ŋ/ in Mari (Meadow /eŋəž/, Hill /əŋgəž/); and the Khanty reflex *-ääńć only appears as a latter member in compounds, and hence could be expected to have gone thru a bit more reduction than usual anyway. But this is about as far as we get before things stop working.

In Permic we find /m/: Udmurt /emedź/ ~ /emeź/, Komi /ɤmɨdź/ ~ /ɤmidź/ ~ /ɤmedź/ (probably < PP *ɛmedź). This is not an entirely unprecedented reflex of *ŋ, but usually development to *m only takes place adjacent to labial vowels, and even then should be usually retained in various Udmurt dialects. Compare e.g. PP *pɔŋ ‘head’, whence northern and central Udmurt /pum/, southern Ud. /puŋ/, most Komi /pon/, northernmost Komi /pom/.

Within Mordvinic, Erzya /ińźej/ ~ /ińźeŋ/ (also /ińdźej/ in Paasonen’s dialect data) could seem to suggest syllable contraction and POA assimilation similar to the implicit Khanty development: *-ŋVć- > *-ŋVź- > *-ŋź- > /-ńź-/? However, this fails in light of Moksha /ińəźi/, /ińiźi/, which has evidently escaped syncope, showing that the word was still trisyllabic *iNəźəŋ in Proto-Mordvinic. But medial *-ŋ- should then have definitely yilded **-j-! A development *-ŋ- > *-ń- that is probably being supposed here is otherwise entirely unknown in Mordvinic. Same would go for any kind of a suggestion of secondary epenthesis *-ńź- > /-ńiź-/.

Even in Mari there is the further problem that Hill Mari /ə/ does not regularly reflect PU *ä. This, however, could prove to be the key to the problem. We do find at least one other parallel for the correspondence MMa /e/ ~ HMa /ə/ word-initially, which I think is not accidental: the verb /eŋa-/ ~ /əŋgä-/ ‘to burn’ (from PU *äŋə-, demonstrating also that in Komi the expected reflex of *ŋ in a front-vocalic environment is /ń/). Raspberries happen to have a natural connection to fire: in the taiga zone, they are a typical pioneer species thriving in forest areas cleared by wildfire, sometimes in quite good abundance until crowded out by a shading tree cover. So I will suggest that the Mari name of the raspberry is in fact directly based on the verb stem ‘to burn’, and the “suffix” /-əž/ is what actually brings in the meaning ‘berry’. In light of Khanty, I will further suggest this is indeed an old compound, with the second member continuing a simpler root *äńćä ‘(rasp)berry’.

The other reflexes can be probably treated as compounds as well. In Mordvinic the first member could be perhaps identified with the first syllable of *ińďəŕ ~ *ińďəŋ ‘honeysuckle’ (which grows red compound berries similar to the raspberry) or *ińə ‘big’ (the raspberry grows fairly large berries even in the wild, unlike other culinarily important species such as the strawberry or blueberry). For Permic *ɛm- I do not have any etymology; however, a compound analysis seems to gain some other support from the fact that there is no real evidence for a “suffix” *-edź in Proto-Permic. And then among the very few words showing this ending, another berry can be found too: *pɛledź ‘rowanberry’. Usually this has been taken as a somehow heavily divergent reflex of PU *pićla ~ *pićrä ‘rowan’, and there could be some of this source still involved after all. But PP *pɛledź would also come quite close to a compound of *pȯl ‘time, instance’ + my already hypothesized *-edź ‘berry’. Hence: ‘many instances of berry = clustered berry’, as at least a folk-etymology? *ɛ ~ *ȯ are not quite identical, definitely not if read as IPA [ɛ] and [ɵ]; etymologically they are however both primarily reflexes of PU *ä, coincide in having /ɤ/ as their main reflex in Komi, and Zhivlov (2010) has even already shown that the two are in a complementary distribution in a fairly large number of environments; though the topic will probably still need some work.


A few other berry words have been reconstructed for PU as well, the clearest is *mura ‘cloudberry’, and also *pola or *pala ‘? lingonberry’ is quite probable. Finding any really new etymologies in this semantic area would likely however require first looking at each language group’s berry terminology as a whole… One precedent for this exists, a case study of Finnic berry names by Eino Koponen: “Itämerensuomalaisen marjannimistön kehityksen päälinjoja ja kantasuomen historiallista dialektologiaa”, 1991 (SUSA 83: 123–161); and what this reveals is absolutely rampant analogy, e.g. that Finnish puolukka ‘lingonberry’ has likely been rebuilt almost entirely after/in tandem with juolukka ‘bog bilberry, Vaccinium uliginosum‘, or Fi. vadelma ‘raspberry’ evolving in some dialects, after some other steps, into vaarain under the influence of muurain ‘cloudberry’. This possibility will probably need to be taken into account elsewhere too.

Tagged with: , , , , , ,
Posted in Etymology

Enter your email address to follow this blog and receive notifications of new posts by email.