In the recent years, Tamás Janurik has been releasing online numerous papers, small surveys and reference materials on the Uralic languages, particularly Samoyedic and Hungarian (all mainly thru his academia.edu page). Last week the roster has been joined by what seem like two particularly notable works: Kamassz szótár and Kojbál szótár, two “doculectal-comparative” dictionaries that aim to arrange together and morphologically analyze all currently available lexical material on these extinct Samoyedic languages. Despite titles and introductions in Hungarian, the bulk of both dictionaries actually use German as their main metalanguage. Conveniently (if not for anglomonoglots), basic glosses are also provided in no less than three languages: German, Hungarian and Russian.

The haul is respectable: 1456+114 word groups for Kamassian and 570 for Koibal (with Russian loanwords in Kamassian listed separately from the “native Siberian” word stock [1]). A comparison that easily springs to mind is with the etymological lexicon of Helimski’s Die matorische Sprache, documenting 1134 word groups across all varieties of Mator, and at least the Koibal dictionary might reach similar status as a standard lexical source. For Kamassian there still remain unpublished archive materials though, some already from the main field researchers Castrén, Donner and Künnap. Given their close relationship, in principle it might be also a good idea to eventually arrange all Kamass–Koibal material in a single etymological database or the like.

So far I’ve been poring over the Koibal data and its etymological remarks. Going back to the original sources of Spasskiy and Pallas (and also cataloguing their later appearences especially in the works of Klaproth), Janurik turns out to identify a good couple dozen more Koibal cognates for Kamassian and other Samoyedic languages than are listed in earlier reference works. No more than four of these lack Kamassian equivalents altogether, though: from Spasskiy корламъ ‘to ask’ (PS *kå-), пысва ‘rotten’ (PS *poså- ‘to rot’), тугуламъ ‘to gnaw’ (PS *t¹okɜ-); from Pallas chailàn ‘gull’ (PS ? *kələjə). This could be though in part due to how Janurik does not seem to propose any entirely new Proto-Samoyedic roots, and limits himself to adducing new Kamass and Koibal reflexes for previously known ones. This still leaves a good number of unetymologized vocabulary awaiting further research. All these are now at least well identified and collected together. Janurik employs an admirably detailed scheme of marking each word group with an etymological code: P1–P5 for words that seem native to some extent within Samoyedic, L1–L3 for post-Proto-Samoyedic loanwords, XX for entirely isolated words. The distinction between his layers P1 (Proto-Uralic) and P2 (Proto-Samoyedic) is not quite up to speed on 21st-century research, but this is a minor detail here. Similarly I wonder about at least the naming of his group P3 (Proto-South Samoyedic), when it is Janurik himself who has presented one of the clearest arguments against assuming such a subgroup. [2] But it is certainly of some value to distinguish Kamass–Koibal words with and without northern Samoyedic cognates, as the latter e.g. might be more likely to turn out to be areal loanwords rather than actual common inheritance.

The newly identified cognates so far already provide food for thought anyway. For a simple example, the aforementioned chailàn ‘gull’ seems to be slightly off compared to the earlier PS reconstruction, suggesting rather something like *kəjələ. A slightly better match in root structure could be actually UEW’s *kaja(-ka) ‘gull’; or, since PU *a > PS *å > Kamass–Koibal a is a minority development (normally *å > o, u) and incompatible with the potential Nenets and Selkup cognates that certainly require *ə, maybe the best solution would be independent formation after all from a mimetic root √kaj-.

A second bird name that leaves me thinking is Km. šēgə ~ Kb. сега ‘cuckoo’. This could be derived entirely regularly, together with cognates in Selkup, from PS *käkV. Clearly this is another old mimetic term, at least predating the assibilation of PS *k to *š; but how old exactly? Several compareable words for ‘cuckoo’ turn up again also further west in Uralic, including Khanty *käɣii, Udmurt /kikɨ/, Komi /kɤk/ and Finnic *käki (the first three reported but considered improbable in SSA). The medial consonants and vowel correspondences do not entirely behave though. At best Khanty and Finnic would point to *käkə, Samoyedic and Permic to *käkkä; or maybe Samoyedic and Khanty to *käkä. This all might not be fatal in a bird name; some of this could be reshaping to retain a more iconic shape for the word (whereas e.g. from *käkə we would otherwise expect *kä in Samoyedic). But then we could ask as well if this is not due to the words being independently formed; or borrowed even: the Finnic words have been often considered to be loaned from Baltic (cf. Lithuanian gegužė with a dialect variant gegė), though this remains uncertain too for similar reasons. — Really the entire distinction between “reshaping” and “independent formation” seems somewhat vacuous when dealing with words of this sort that have had an iconic motivation available all along. Quite likely Proto-Uralic did have a name for the cuckoo that was something like #kVkV, but if this has actually survived in an expected regular shape anywhere would have to be guesswork. [3]

Next up, the case I find the most interesting are the Kamassian and Koibal words for ‘son-in-law’. I’ve already noticed earlier that the former would go well with a hypothesis I have on the reconstruction of this word in Proto-Uralic, and Janurik’s newly adduced Koibal cognate seems to support the idea further. Actually even the Kamassian cognate has not appeared in etymological references earlier as far as I can tell. This is not a major surprize, since the form is malmi, quite far from either SW’s Proto-Samoyedic reconstruction *wiŋə or UEW’s Proto-Uralic reconstruction *wäŋe.

The first key to this puzzle is provided by Kamassian alma ‘dream’. Nominally this comes very close to Ugric forms for the same (e.g. Hungarian álom : álmo-), and UEW goes as far as to support a wild proposal of a loanword from Khanty. Janhunen in SW however suggests a different solution. Within Samoyedic a clearly different root can be reconstructed for ‘dream’: *äŋwå, and the Kamassian word could be derived from this via assimilation–then–dissimilation, *ŋw > *ŋm > lm. Such a sound change series would already provide more grounds for comparing malmi with PS *wiŋɜ (note also that *w- > *b- > m- before a word-internal nasal is a known regular sound law). The Koibal cognate identified by Janurik comes in at exactly this point: we find here the form манмемъ (most likely an 1PS possessed form ‘my son-in-law’), suggesting that also this instance of Km. /lm/ has indeed evolved from *ŋm. I would not be certain on if this should be taken as still containing /ŋm/ however (thus Janurik) or, as it can be read prima facie, /nm/. This latter could be still archaic with respect to Kamassian of course, i.e. in more detail we would have *ŋm > /nm/ > /lm/. (The other possible routing I guess is *ŋm > *ɫm > /lm/, slightly more awkward since there seems to be no reason to assume a distinct velarized *ɫ at any point in the history of Kamassian.)

Where would this word-internal *m < *w come from then? I suspect it has actually been there all along. For one, we already have various forms like Finnish vävy and Mator mijüh (миюгмэ) pointing to some kind of an original labial element near the stem vowel, which has already led to newer reconstructions along the lines of PU *wäŋəw(ə) rather than bare *wäŋə. [4] For two, the Samic reflexes of this word shows a long-standing minor problem: they indicate Proto-Samic *vivë, with a seemingly Finnic-like development *ŋ >> *v. I would suggest that this issue is due to incorrect segment alignment: that Samic *v does not continue the original 2nd-syllable *ŋ, but instead the 3rd-syllable *w, and original *ŋ has been instead lost to a vocalization process of some sort. If correct, this would show direct evidence for a reconstruction *wäŋəwə (i.e. ruling out anything like **wäŋü with a labial vowel in PU already), making the PU shape of the word actually a relatively good fit at least for the consonant skeleton of Kamassian and Koibal. I could even suggest reconstructing for PU a morphophonologically alternating paradigm, with a vowel stem *wäŋwə- (> Samic, Km–Kb) : consonant stem *wäŋəw- (> Finnic, Nenets, Nganasan etc.); though this is motivated also by some other considerations that would take us fairly well afield from the current topic.

There is definitely still room for skepticism about this however, and in particular the vowel correspondences continue to be quite irregular: in the first syllable, none of PU *ä, PS *i and Kamass–Koibal a regularly corresponds to each other, while in the 2nd syllable, Km. -i ~ Kb. -e most typically continues PS / PU *-ä, not *-ə.

So far I have not started any systematic investigation of the entirely unetymologized Kamassian and/or Koibal vocabulary remaining. However, for closing, one simple observation on this front: kuro- ‘to be angry’ (in both Km. and Kb.) probably continues PU *kurə ‘anger’.

[1] i.e. native Samoyedic words, Turkic and Mongolic loanwords, and all vocabulary of unknown origin.
[2] Janurik, Tamás. 2012. Volt-e a déli-szamojéd (PSS) alapnyelv?Per Urales ad Orientem. Iter polyphonicum multilingue: 145–162.
[3] A further complication still is the potential Mator cognate / reflex: géihe in Pallas, кига in Müller, per Helimski suggesting PS *-jk- rather than plain *-k-. However the precedent of PS *äjmä ‘needle’ > Kamassian ńīmi ~ Koibal неме would maybe then seem to predict ˣšīgə for ‘cuckoo’, and we are right back in not knowing which way irregular correspondences in iconic or onomatopoetic vocabulary should be interpreted.
[4] This final *-w(ə) is strictly speaking not segmentable, but it is probably originally the same formant as also in two other in-law terms: PU *käləw(ə) ? ‘sister-in-law’ and *nataw(ə) ? ‘brother-in-law’.

Analogy Is Not Phonology

While my blogging here has been firmly within historical linguistics, every once in a while I do go poking around self-styled formal linguistics blogs too. [1] This tends to be a frustrating exercise though. By now, supposedly deep problems discussed around such parts tend to strike me as, frankly, dumb questions that only exist due to particular “theoretical commitments”, and which could be trivially resolved or avoided within better-grounded frameworks of understanding language. People stuck in generativist bubbles in particular, however, seem to be often unaware that any other types of approaches would exist at all.

As I’m rather more informed about the ground-level facts of phonology than e.g. syntax, this is going to be the more profitable area for me to comment on in any real detail, though generative syntax has also struck me as having foundational flaws roughly analogous to the foundational flaws of generative phonology. (I presume open-minded syntacticians should be even able to figure out these, ahem, analogies themselves, without me having to do all their work for them.)

At any rate, a good majority of questions attracting protracted debate in phonological theory that I have seen are immediately solved under the traditional non-generativist approach: “phonological processes” or “deep structures” do not exist as such. They are only grammatographical shorthand; rules of thumb, not rules of Grammar. [2] Where non-allophonic “phonological alternations” actually exist is within the lexicon, not within phonology.

A standard counterargument to this seems to be the fairly simple observation that loads of obviously non-allophonic alternations are, in fact, still productive to this or that extent in loads of languages. Checkmate, lexicalists?

No, of course not. This simply shows a particularly pernicious systematic failure of generative linguistics — a lack of understanding of language change, particularly that language change, including linguistic creativity, does not take place solely inside a box of “Grammar”, but also within the lexicon. Phonological alternations are easy to approach in this fashion, as they are generally not actually productive in the sense of immediate, universal applicability (as they say in Generativistland, they can be opaque). Moreover quite typically they are “productive in spats”, creating new forms one by one, now and then in the speech of particular speakers, not everywhere constantly. And the range of applicability for any one process is very finite really: while everyone creates novel noun phrases practically daily, I would wager that most people do not create any entirely novel strong verb forms over their entire life. [3] In historical and historically-informed linguistics, our default assumption is to attribute these kind of changes to the process of lexical analogy, and understanding it is vital to understanding patterns that arise and exist in language.

What we can actually observe is that any arbitrarily deep alternations can indeed inspire the coinage of new instances of the same, and therefore they can remain “productive”. If desired, I can readily coin all sorts of cases like longlengthoblongoblength or singsungwingwung. But then nothing stops me from creating folk-etymological examples either, say choosechoicesnoozesnoice. These also fade organically into snowcloneish blends, e.g. thanks, antsthantshello, horseshellorses; spelling pronunciations, e.g. tentacles ∶ /ˈtɛntəkəlz/ ∷ Pericles/ˈpɛɹɪkəlz/; or (mis)etymological nativization, e.g. English wrong ∶ Swedish vrång ∷ En. to wring ∶ Sw. vringa. Crucially, what needs to be noted is that this is an extralinguistic cognitive skill that should not have any bearing on the development of purely linguistic theory. Already etymological nativization refuses to respect the confines of a single language, and I think most theories of mental grammar would likewise not attempt to account for spelling pronunciations. We can also easily advance loads of more or less formal analogies in areas that have nothing to do with language, from mathematics (2 ∶ 20 ∷ 5 ∶ 50; square ∶ cube ∷ triangle ∶ tetrahedron) to the natural world (nitrogen ∶ ammonia ∷ oxygen ∶ water; the Congo ∶ leopards ∷ the Amazon ∶ jaguars) and human society (evolution ∶ Darwin ∷ relativity ∶ Einstein; punks ∶ pop punk ∷ ravers ∶ happy hardcore). This, I think, demonstrates beyond reasonable doubt that analogy in fact is a general skill that humans possess, and hence there’s no point in trying to reduce its applications in language into some kind of specifically linguistic primitives.

(Note BTW that while all my examples above are phrased as classic proportional analogies, this also should not be assumed to be the only possible or even the main mechanism of analogy.)

Once we accept the existence of analogy as an explanation for some cases of morphophonological productivity, this provides also a direct path into rich gains in parsimony. My linguistic examples above have been chosen to be on the “clever” side, i.e. building on only marginal precedents, partly to be sure that they’re indeed novel (at minimum to me!), partly to make it seem more convincing that they should not be modelled by inserting additional epicycles into English (morpho)phonology. But the mechanism of analogy works perfectly well also on any pedestrian phonological alternations out there. What is, say, the plural of oblength? It’s clearly oblengths — but then we could model this conclusion as having been drawn purely on the analogy of lengths, or also tenths, shibboleths, Beths, etc., without needing to assume any distinct, exclusively linguistic machinery behind this. The putative outcome oblengthes, just like also morphologically clearly different options like oblengthim or oblengtha, can be predicted to be unlikely already due to the lack of bases of analogy that could lead to them. [4] That all sorts of other coinages also follow the same pattern could be likewise explained already by the extremely strong precedent for the English plural marker to be -s. In principle even the regular phonologically conditioned allomorphy between -s and -es could then turn out to be simply emergent within the English lexicon, if we enrich it with sufficiently many plural forms stored as lexemes. This approach allows cutting out a hefty amount of costly theoretical complexity assigned to phonology in theories that fail to recognize that analogy exists.

Spending one further moment within philosophy of science, there is certainly also an apparent countercost of presuming the existence of some words like lengths as separate from length (or sung from sing, etc.). However, given that lexicons already indisputably exist, and contain many, many thousands of items anyway (and that, given the phenomenon of suppletion, these indisputably can be syntactically specified as particular inflected forms, etc.), just a few hundreds more to “seed” it with generators of morphophonology should be unambiguously considered the superior solution. Extra stuff is free.

It would be indeed possible to go further still and to propose that e.g. even the realization of oblengths as specifically /ɒblɛŋθs/ with /-s/ (and not /-z/) will be inferred by analogy from other English plural forms. It’s hard to rule out that this could not be the case for some people. [5] But I do grant that this at least is not an approach that could be fully generalized. Analogy generally allows for multiple solutions, some of them perhaps much less probable but still possible (e.g. if we take a cube as a prism with a square base, not as a polyhedron entirely made of squares, then the triangle analogue will be a triangular prism, not a tetrahedron; and maybe it should be heorses /hɛəɹsɪz/ rather than hellorses). Allophony by contrast is, by all appearences, subconscious enough that speakers find it difficult to create or perceive forms departing from it, and it clearly calls for a different kind of cognitive machinery.

[1] That’s the {self-styled formal linguistics} blogs; what they call themselves is, apparently, just “linguistic blogs”, with the common if vaguely cultish stance that only their branch of work actually constitutes Real Linguistics.
[2] As far as I can tell, a lot of trouble indeed comes already from the failure to fully distinguish descriptive grammar from mental grammar. Much of the early history of morphology and syntax quite transparently consists of attempts to formulate rigorous definitions for concepts of traditional Greco-Latinate grammatography like “subject” or “word”, but with little attention paid on if this even should be done: a priori there is no reason to expect mental grammar to have any building blocks at all in common with traditional descriptive grammar (much like how, say, biochemistry is not under the obligation to follow any views of Aristotelean natural philosophy). Modern theory of phonological processes indeed also looks like as if it largely amounts to applying the same mistake ultimately to Pāṇini’s descriptive (morpho)phonology of Sanskrit, although the road from there to Chomsky & Halle is not clear to me.
i.e. “novel to English (or German, etc.) as a whole”. E.g. (a soup has been) wung might be a new creation for me just two days ago (‘prepared without a prior plan or recipe’, if you must know), but even before checking I am certain that others have stumbled on this same territory before. — Oh yes, no question about it: it’s even on Wiktionary already, with attestations going back to 1881.
[4] But, of course, not impossible. As e.g. advanced linguistics students faced with the wug test will readibly demonstrate, sufficiently large numbers of contrarian smartasses will eventually end up creating any form imaginable, no matter how “ungrammatical”. Almost nothing in language is actually impossible. This is perhaps the most clearly so when a phenomenon is “impossible” (rather, inacceptable) in one language variety but business as usual in another.
[5] Definitely not for me though. As an L2 speaker whose native language has no voiced fricatives, I ended up adopting the English plural marker(s) as just /-(i)s/ back in the day, and though I can by now make conscious effort to use [z] instead, I will be still quite content to speak of [windous], [siːliŋs], [hɑusis], [tʃʰiːzis], [nɑiʋs], [dɔgs], etc…

Examples of reductive primary splits

On a whim I have started reading the Oxford Handbook of Historical Phonology. At about two and a half chapters in I have finally reached some discussion of practical questions in some detail, and the first claim to have struck me as empirically interesting is that “primary split can also reduce an inventory”.

For those not up to speed with or just not recalling the lingo (this is after all one of those terrible user-unfriendly terminological conventions along the lines of “type 1 / type 2 error“), I remind that a “primary split” is a conditional sound change that creates a sound (or rather, phoneme) already present in a system, contrasted with a “secondary split”, a conditional sound change that creates a sound not previously present. (I would advocate for using the more descriptive terms “split with merger” and “split without merger.) [1] They are distinct from a simple unconditional merger, or for that matter, from an unconditional non-merger. [2] Any particular change can fall under any one of these depending on the language.

My first thought were cases such as the fate of the labiovelar stops in Greek. Depending on their environment, these are reflected as any of the three “basic” stops (e.g. *kʷ > /p/, /t/, /k/; similarly for Proto-Greek *kʷʰ and *gʷ), and hence they ultimately disappear from the phoneme inventory. This kind of a situation does not seem to really show that primary split could eliminate a segment from a language’s inventory, though. Although any one of the changes could be stated conditionally, in reality one of the three changes must be the most recent chronologically — and at this point this change is then no longer a conditional sound change, but simply an unconditional merger. (I believe this status belongs to *kʷ > /p/. [3]) A similar sleight-of-hand could be really pulled whenever a sound eventually develops into multiple different reflexes: phonological inventories only offer a finite number of relevant environments, and even if there in fact is a default reflex, it can be also stated in terms of a set of particular environments. E.g. the development of PIE labiovelars in Indo-Iranian or Slavic could be stated roughly as “palatals before front vowels, velars before consonants and non-front vowels”, although only the former development is conditional, the latter instead unconditional; and indeed even feeding into, rather than independent of, palatalization before front vowels. (I.e. *Kʷ >> *Č / _E is, properly speaking, not a sound change but a sound correspondence, consisting of (at least) two sound changes: *Kʷ > *K followed by *K > *Č / _E.)

But it turns out the claim is actually something simpler. Proposed in a 2012 article “Primary split revisited” by Robert Blust, the idea is instead: if the segments involved are subject to positional constraints, afterwards it may be now possible to analyze either one of them as being now an allophone of something else. (He also passingly considers exactly the same example of labiovelars in Greek, with citation to a 2000 textbook by Sihler, but without noticing the flaw from chronology.) So the actual sounds involved do not disappear from a language’s phonology; they merely now end up in a complementary distribution, and the number of phonemes can be argued to have fallen. Certainly this should be possible.

Curiously, Blust presents this analysis as only a theoretical exercise, and ends up unable to propose any actual examples of the phenomenon. Google also tells me that Blust’s term “reductive primary split” still finds no additional hits out there. I take it upon myself to therefore offer a few examples.

1. Loss of *ŋ in Proto-Finnic

Proto-Uralic had an inventory of four nasals, *m *n *ń *ŋ. The Finnic branch has however reduced this inventory to just two, *m *n. [4] The fate of the palatalized nasal *ń has been simple, merging into *n (I believe with some vowel-coloring effects word-medially; but this is tangential to the point). The fate of the velar nasal *ŋ is more diverse. The most typical intervocalic reflexes are zero (with lengthening of the preceding vowel) and *v, presumably thru earlier *w; in consonant clusters, *Cŋ > *Cv, *ŋC > *uC, both presumably again thru *w. I would additionally posit an even earlier intermediate stage *ɰ behind both the zero and *w reflexes.

One exception to all this is found: the cluster *ŋk, surviving phonetically intact into Proto-Finnic and indeed into the modern Finnic languages. Phonologically looking, however, it would seem that there has been a change here as well. *[ŋ] cannot be reconstructed for Proto-Finnic in any other environment, and hence we now have reason to interpret [ŋk] as /nk/ (or if we really wanted, /mk/, or even /Nk/ with a neutralized placeless coda nasal). Thus the splits-with-merger *ŋ > ∅, *ŋ > *w and/or *ŋ > *ɰ have been reductive: even though they leave some instances of [ŋ] unscathed, */ŋ/ as a contrastive phoneme is still lost. All of this has been already noted at least as early as by Posti (1953).

This reductive primary split also in fact functions somewhat differently from Blust’s toy example. He suggests an example of a language contrasting /t/ and /s/ only before /i/, showing elsewhere only [t]; if, then, [ti] shifts to [si], the result will be the loss of this contrast — thus yielding /ti/ rather than /si/. In Finnic, it is however not the contrast *ŋ | *w that ends up lost; and what allows the final phonological reanalysis is not the earlier distribution of either of these consonants, [5] but rather the limited distribution of the “third wheel” consonant *n, which earlier did not occur in the position before *k.

2. Loss of *ɣ in Proto-Ugric

A reductive primary split that does function similarly to Blust’s example might be also found in Uralic. The current conventional reconstruction of Proto-Uralic includes a rare consonant *x, occurring only intervocally (and when followed by 2nd-syllable *ə, though this proves to be inessential to the point). Its reflexes across Uralic point towards a velar obstruent of some sort, though it does seem to have been distinct from *k, the other, well-established velar obstruent. We also find that the reflexes of *x and intervocalic *k indeed coincide to a large extent across Uralic. In some cases, the reflexes are inconveniently either zero (thus Permic, Mari, Samoyedic) or merged with something else still (thus Mordvinic). Here we cannot clearly rule out the option that it is *x that is first unconditionally lost or merged, followed by *k along the same trajectory only later. A merger to a distinct velar reflex *ɣ can be however found in the two Ob-Ugric language groups, Mansi and Khanty. The third Ugric language, Hungarian, has been proposed to also have passed thru a similar stage. If we suppose *[ɣ] was indeed the original sound value of “*x”, we would seem to have here exactly Blust’s situation: the contrast *ɣ | *k originally occurred only intervocally, and therefore the result of an intervocalic lenition of *k to [ɣ] will be counterintuitively phonemically */k/, not */ɣ/.

This situation would have been quite temporary, though: in all three Ugric groups, degemination of *kk probably introduced quite soon a new medial *k, leaving *ɣ again (?) a contrastive segment of limited distribution. At least the apparently parallel degemination of *pp, *tt however cannot be reconstructed for Proto-Ugric: it must be preceded by the lenition of medial *p and *t in Hungarian, which yield modern v, z, presumably thru intermediate *b, *d > *β, *ð. In Ob-Ugric by contrast *p, *t remain without lenition. Thus probably also *kk still remained in Proto-Ugric; and in any case degemination of *kk must postdate at least the lenition of single *k.

Of course a worse problem still is that the analysis depends on not particularly certain details of PU reconstruction. If *x was not *[ɣ] but something else, like *[x] or *[q], it would have been possible that it is intervocalic *k that first lenites to *[ɣ], after which *x simply unconditionally merges with this new allophone of *k.

But we can perhaps try again:

3. Loss of *ɣ⁽ʷ⁾ in Khanty varieties

The saga of *ɣ continues further in Khanty, with some rather similar development as in the previous case. From here on, the contrast with *k seems to be generally maintained (though we do find both of them giving /χ/ as a conditional reflex in Western Khanty). Instead it is the contrast *w | *ɣ that trends towards neutralization. One example could be found in Eastern Khanty, where intervocalic *w develops to *ɣʷ; and *ɣ splits, at least in the Surgut dialect group, to [ɣ] ~ [ɣʷ], the latter following most (but not all) Proto-Khanty labial vowels. We have some reason to consider the latter change older than the former: it is shared also with Western Khanty (with further *ɣʷ > /w/) and it could be reconstructed as an allophonic change already for Proto-Khanty. If so, *w > *ɣʷ in Proto-Eastern Khanty would be a reductive primary split: its result will be that *[ɣ] ~ *[ɣʷ] are now in a complementary distribution with word-initial *[w], and therefore they can be considered allophones of a single phoneme.

This situation, however, is not reflected as such in either of the two main branches of Eastern Khanty. In Surgut Khanty, mergers such as *ü > /i/ have now left /ɣ/ distinct from /w/ (= [w] ~ [ɣʷ]); in Far Eastern (Vakh-Vasjugan) Khanty, medial *p has been lenited to a new [w], while my proposed intermediate *[ɣʷ] has lost its labialization, likewise leaving /ɣ/ a clearly distinct phoneme from /w/. It would be also possible to suppose that *[ɣʷ] actually occurred in Proto-Eastern Khanty only as a medial allophone of /w/, and later [ɣʷ] as a reflex of *ɣ is an innovation of Surgut Khanty in particular, perhaps only at most areally connected with Western Khanty. Something like this is indeed suggested by Proto-Khanty roots of the shape *PÜɣ- (with a bilabial initial and a front rounded vowel) — these give /PIɣʷ-/ as expected in Surgut Khanty, but in several varieties of Western Khanty (Southern Khanty, transitional South/North dialects of Nizjam and Šerkaly), cheshirization to *ɣʷ > *w either fails to take place or is reverted, giving instead /PIɣ-/. Similarly, Proto-Khanty *-ăɣ- (probably with *ă being labial [ɒ̆]) gives Surgut /-ăɣʷ-/ but Western *-oχ-. Here too we don’t seem to have much evidence of a common Proto-Khanty development to *ɣʷ, and we should probably assume a separate labialization in Surgut (though something like *ɣʷ > *ʁʷ > *χʷ > /χ/ is at least theoretically conceivable).

The traditional reconstruction of Proto-Khanty (see e.g. Honti 1999: 75–77), actually goes even further and does not recognize distinct medial *w at all. In such a system, *ɣ would appear to have been an allophone of /w/ already at this point. This though implies positing a conditional merger *w > *[ɣ] already between Proto-Ugric and Proto-Khanty, which itself will be then a reductive primary split. — But I do find it preferable to assume that Proto-Uralic and Proto-Ugric *w was simply maintained as distinct all along in Western Khanty, especially since it seems to be possible to identify minimal pairs; one is Southern /sŏw/ < *sŏw ‘pole’ vs. /sŏχ/ < *sŏɣ ‘skin’.


I could probably think of several further examples of reductive primary splits in various languages — these have simply been the first three examples to come to my mind straight away. I can easily agree with Blust that perhaps this theoretical possibility has gone so far unrecognized due to an overreliance on just a few canonical examples mostly from Indo-European in discussions of the typology of sound change.


Honti, László. 1997. Az ugor hangtörténethez. Az ugor alapnyelv kérdéséhez: 31–39. Budapest.
Honti, László. 1999. Az obi-ugor konszonantizmus története. Szeged.
Posti, Lauri. 1953. From Pre-Finnic to Late Proto-Finnic. Finnisch-Ugrische Forschungen 31: 1–91.

[1] The “primary” / “secondary” terminology moreover seems to me to be kind of backwards. “Primary” splits appear to be unnecessary to assume as a separate phenomenon on the phonetic level at all, since it seems to me they can be always modelled as a series of two sound changes: a “phonetic secondary split”, followed by an unconditional merger of the newly created allophone.
I have not seen this fourth option identified often, but it seems to be appropriate for any sufficiently advanced “phonetic drift”, taking a segment so far off-field that it cannot be identified anymore with its original phonological value. E.g. although no conditioning or merger has been necessarily involved, it would seem to be not at all appropriate to characterize West and Northwest European [ʁ] or [ʕ] as either a trill or a coronal, despite its origin from earlier /r/ (of course, meaning those varieties where a guttural fricative is the typical realization — when we do still find intermediate [ʀ], we could at least argue that this is the target realization and [ʁ] realizations are merely speech errors).
The bilabials /p pʰ b/ are the result before a consonant, as well as before the “noncoloring” vowel /a/ and the “weakly labial” vowels /o ɔ/, i.e. environments where there is not much motivation for a conditional development. The dentals /t tʰ d/ are instead triggered by following front vowels /i e ɛ/ (assimilation), the velars /k kʰ g/ by a following or preceding close labial /u/ (dissimilation).
[4] The /ŋ/ encountered in modern Finnish is a later development, primarily by consonant gradation from *ŋk, later reinforced by loanwords. The native origin still leaving the interesting trace that singleton ˣ/-ŋ-/ remains foreign; only the geminate /-ŋː-/ is found intervocally. Similarly, /nʲ/ in modern Eastern Finnish, Karelian, Veps etc. arises by secondary palatalization, most widely thru apocope of *i.
[5] *ŋ does have a more-limited-than-average distribution in PU, being barred from the word-initial position. However, nothing in the analysis would change if we assumed that there did exist a word-initial *ŋ- that likewise changed to *w > *v in PF.

Some Recent Vogulology

(By current standards this perhaps should be “Mansilogy” or “Mansi Studies”, but “Vogulology” just has a good sound to my ear.)

1. Word-final vowels

This summer has seen the publication of the Festschrift Ёмас сымыӈ нэ̄кве во̄ртур э̄тпост самын патум [1] dedicated to our (i.e. of Finno-Ugric Studies at University of Helsinki [2]) professor Ulla-Maija Forsberg / née Kulonen. This includes my paper “Notes on Proto-Mansi word-final vocalism“, where I mostly focus on the somewhat elusive category of Proto-Mansi *ə-stems. These can be consistently directly distinguished from plain consonant stems only in 18th century Mansi records from assorted southerly dialects, but I argue that their former existence however leaves indirect evidence in a fairly large number of places.

  • They condition / phonemicize the rise of the new vowel length split in Central Mansi (as first recognized by Mikhail Zhivlov): originally long in open syllables, short in closed syllables, thus *CVCə > /CVːC/, but *CVC > /CVC/.
  • Coda spirantization of *k in Central Mansi takes place already before apocope: *CV(C)kə > /CV(C)k/, but *CV(C)k > /CV(C)x/, and *CVkCə > /CVx[ə]C/, but *CVkC > /CVːk[ə]C/ (probably *CVk[ə]C already to begin with).
  • Nasal cluster simplification also takes place already before apocope: in Southern and Central Mansi *CVNTə > /CVNT/, but *CVNT > /CVT/, affecting all nasal+obstruent clusters (in Southern further *CVŋkə > *CVŋk > /CVŋ/); in Northern Mansi only *CVNF > /CVF/, affecting only the nasal+fricative clusters *nč > *nš, *ńć > *ńś, and (though I ended up forgetting this from the paper) *ŋq > *ŋχ.
  • Conditional retentions: Southern Mansi *CEĆə > /CEĆiː/ (i.e. *ə > /iː/ following palatal vowel + palatal consonant); possibly Northern Mansi *CU(C)Cə > /CU(C)Ci/ (i.e *ə > /i/ following a close vowel = /u/ or /i/).

There are some complications to the first three lines of evidence, since they only affect / happen before coda consonants. They therefore create new morphological alternations in inflected stems, such as nom.sg. *pōt ‘pot’ : nom.pl. *pōt-ət ‘pots’ [3] >> Western Mansi /put/ : /puːtət/. Later on, these alternations have often been levelled out in favor of one “grade” or the other in individual dialects. This is probably the reason for occasional apparent irregularities such as *kōnt ‘backpack’ > Pelym /kunt/ (rather than expected ˣ/kut/), although I have not combed for them in detail. — This would really require also a discussion of the same changes in verb inflection and word derivation, where they can arise also depending on consonant-initial versus consonant-final suffixes. At least //NF// simplification in Northern Mansi is well-described in standard references already (e.g. Keresztes’ 1998 handbook description mentions the examples ľuuńś-i ‘weeps’, suns-i ‘looks’, χaaŋχ-i ‘climbs’ : ľuuś-səm ‘I wept’, sus-səm ‘I looked’, χaaχ-səm ‘I climbed’). Eichinger’s new grammar of Western Mansi (see below) recognizes all three of vowel length alternation, //NC// simplification and x ~ k alternation, the last interestingly in an inverted form from the historical derivation: stem-final //x// → k before a vowel-initial suffix. For the rest I would need to look up a variety of sources and see how much of this they recognize.

There would be also some implications whose discussion I have left for later work altogether. E.g. the loss of *-ə appears to leave Central Mansi /x/ in fact marginally phonemic in all varieties. However, it has been treated as only a free variant in some (chiefly Hungarian) works. The most notable offender is the UEW, where e.g. Pelym /kulx/ ([kuləx]) ‘raven’ is given as “kulk“; /ńoxʷs/ ‘sable’ is given as “ńoks“; /püxń/ ([püxəń]) ‘navel’ is given as “pükəń” (note also inconsistent treatment of schwa). Thus, there is a lesson here against trying to apply overly strict methodology to the segmental phonological analysis of poorly documented language varieties. The limited corpus of Central Mansi varieties may not have allowed finding minimal pairs, but this should not be taken as grounds to ignore the distinction entirely. This problem has come up before in phonological analyses of Ob-Ugric varieties as well. Other such cases include e.g. the status of labialized velars /kʷ xʷ ŋʷ/ all across Mansi, discussed already by Kálmán (1976) [4], the short vowels /e ɶ ɤ u/ in Eastern Mansi and the open rounded vowels /ɔ œ/ in Far Eastern Khanty.

2. Archival Mansi

Julia Normanskaja has in the last few years published reports and analyses of several archival materials of Mansi in the journal Ural-Altaic Studies (now added to my sidebar). The earliest came out in volume 19 (4/2015), covering a 1905 dictionary of the Pelym dialect as well as new 2013 field records on the Middle Ob and Jukonda dialects — the latter perhaps the last records of Eastern Mansi, collected from two recently found elderly speakers. Instead of integrating these with the established framework of Mansi historical phonology though, she has opted to compare them only with the Sosva-based Northern Mansi written standard, ending up with a very reduced seven-vowel reconstruction of Proto-Mansi or maybe rather Proto-Non-Southern Mansi (“core Mansi” as I have called it) that unfortunately doesn’t seem to be very functional for anything else. A follow-up article in volume 26 (3/2017) explores the Pelym material a bit more, but it does not turn out to show any previously unknown features at least in its phonology. Presumably it would have more value for the lexical documentation of Mansi, perhaps even for etymological research.

Two further works this year I have found more interesting. In volume 36 (1/2020) she treats an unpublished 18th century Mansi dictionary that appears to not fit within the current classification scheme of Mansi: it shows some innovations typical for Northern Mansi (*kʷä- > ко-, *aɣ > оу, *q > х) but fails to show some others (*ä, *š retained as е, ш instead of being simplified to a, s). Even transitional development can be found in по́улъколъ ‘bathhouse’: *äɣ > оу here is distinct from the reflexes in all later dialects (S päwl-, W E päɣl-, N puwl- ‘to bathe’). Specifically non-Northern innovations do not seem to be found though, and I at least would thus simply consider this variety to represent early Northern Mansi before the rise of some more recent innovations. A brief comparison with the older Mansi materials available to me does show the same archaisms in some other early NMs records as well, e.g. *šëëtə > schat ‘100’ (later > sāt).

Most recently, volume 38 (3/2020) now treats several further 18th-century wordlists, namely some very southwestern ones from within the current-day Perm Krai, which she identifies as their own dialect group, though still affiliated with the Tavda dialect that has later on been the “type specimen” of Southern Mansi. I cannot agree with all aspects of her analysis here — e.g. graphical ‹а› for Proto-Mansi *ëë I would think most probably only reflects an inability of 18th-century Russians/Germans to distinguish [ʌː] and [ɑː] — but the overall point seems to be sound: the dialect differentiation of Mansi can be expected to have begun already in the south. A feature that does appear to constitute a shared innovation is the lowering of short *u to ‹о›, but probably this is not yet enough by itself to set up much of a common Southern Mansi dialect area covering both these and Tavda.

2.5. A *č in Proto-Mansi?

All this attention to 18th century Mansi also got me started on assembling an overall overview of the data. Most of it is still not published anywhere, but Gulya’s 1960 article first noting the retention of final vowels [5] cites seemingly all available evidence for a list of about 100 words. This could already have some value for surveying in more detail the development of the Mansi dialect areas over the 18th and 19th centuries.

I can also already submit one initial observation: a few varieties seem to show an affricate, ‹ч› or ‹tsch›, corresponding to usual Proto-Mansi *š. Even more interestingly, this seems to only happen for *š deriving from Uralic *č, not for *š deriving from Uralic *ś (or *ć, as now alternately reconstructed):

  • ‘knee’: M19 ‹tschäntschi›, VTur. ‹ча(н)чи›, SSo. ‹Tschândsche-›
    — cf. Khanty *čäänč;
  • ‘town’: VTur. ‹оча›, SSo. ‹ootsche› (M19 ‹óscha›)
    — cf. Khanty *waač;
  • ‘100’: M19 ‹schäta›, VTur. ‹шата›, SSo. ‹Schôtt›, ‹Schätte›
    — cf. Khanty *saat;
  • ‘heart’: M19 ‹schìima›, VTur. ‹шимъ›, SSo. ‹Schinn› [sic]
    — cf. Khanty *säm.

I would think that this is therefore an archaism: Proto-Mansi had both *č and *š, retained in these three varieties [6] but merged as *š in the others. This of course makes me particularly interested in getting my hands on fuller versions of these three sources in particular and seeing if the pattern keeps up.

3. Three Western Mansi Grammars

I recently discovered also Victoria Eichinger’s PhD thesis “Westmansisch anhand der Textsammlungen von Munkácsi und Kannisto” from 2017. As per the title, this is not an up-to-date language-documentation study but instead a slightly more philological analysis, based on late 19th / very early 20th century fieldwork on the language. It’s a good addition to Mansi grammaticography too, as the now-extinct western dialects have not been subject to much discussion. For an analysis of limited materials it’s fairly thorough, treating also topics that have been mostly left on little attention so far, e.g. morphophonology (the still-living Northern Mansi has much less of this anyway than Western Mansi did). The organization into alternating chapters on the Pelym and Middle Lozva dialects is a bit jarring at first, but seems justifiable enough, especially given brief comparison chapters at the end of each section. The other three WMs dialects thet were recorded more fragmentarily by Munkácsi and Kannisto are generally left out, not a bad option in a generally synchronic grammar. [7] (I do think at least their phonology would eventually deserve a more detailed historical analysis though than what has been done so far.) They only make a small appearence towards the end where Eichinger outlines the main morphological differences between Western and Northern Mansi, even then in a more contrastive than comparative fashion. She does regardless show that parts of a list of seven features that has been suggested to define WMs in earlier research are insufficient, and proposes an amended version.

To me it seems though that even a few of the features given by Eichinger should be still removed from consideration. Two repeating issues are retained archaisms (e.g. accusative case) or heterogeneity (e.g. replacement of the ablative with either postpositions or the lative). Also a bigger open question still might be the direction of comparison. Distinguishing Western and Northern Mansi still remains quite easy. The closest affinity of WMs is instead with Eastern Mansi, forming the Central Mansi group, and among the traditional four-way division of the Mansi varieties, it is the West / East distinction that appears to me to be mostly conventional and not that firmly established. There instead seems to be a cline of increasing innovativeness towards the west, overlaid also with contact effects from Komi and Khanty… It’s probably not necessary to assume the existence of either “Proto-Western” or “Proto-Eastern”, only a single “Proto-Central”. And if so, perhaps some different original dialect cleavage could be assumed for this instead? — At least we now have more good materials for eventually surveying this issue too.

[1] Northern Mansi: /jomas/ ‘good’, /sim-əŋ/ ‘heart-ADJ’, /neː-kʷe/ ‘woman-DIM’, /woːr-tuːr eːtpos-t/ ‘forest-lake month-LOC’ (also /eːt-pos/ ‘moon, month’ readily parses as ‘night-light’), /sam-ən pat-əm/ ‘eye-LAT begin-PTCP’, altogether: “Goodhearted girl born in August”.
[2] She is currently posted instead as the head of the Institute for Languages of Finland though, and has earlier spent quite a while also as the vice-rector of the university. I was happy to catch some of her Mansi courses taught between these some years ago however.
[3] Before anyone wonders in the comments: yes, these might be cognates, depending on how much you like explanations like deriving Northwest Germanic *pottaz from an unattested early Samic reflex of PU *pata (expected PS **pōtē > common Western Sami **puohtē) or anything going back to Indo-Uralic. No loan etymology from IE into all across Uralic seems to be possible though.
[4] Kálmán, Béla. 1976. “Van-e a labio-palatoveláris mássalhangzó-fonéma a vogulban?” — Nyelvtudományi Közlemények 78: 359–363.
[5] Gulya, János. 1960. “A manysi nyelv szóvégi magánhangzóinak történetéhez”. — Nyelvtudományi Közlemények 62: 33–50.
[6] VTur. = Verkhoturye, SSo. = Southern Sosva (Gulya’s “DSzo.”). As far as I can tell, he does not explicitly explain his abbreviation “M19” anywhere, but I think it might mean an unlabeled source, thus microfilm #19 out of the 24 wordlists his paper covers.
[7] The work does clue me in that also a similar Master’s thesis on the Northern Vagilsk dialect was prepared by Eichinger’s project colleague Anna Wolfauer.

First-syllable *ə in Proto-Mordvinic?

The following is, currently, more of a hypothesis I wish to record than an actual result.

Out of the two Mordvinic languages, Erzya shows the simple vowel inventory /i e a o u/ (plus a recent marginal /ɨ/ phonemicized by Russian loanwords). Moksha adds to this firstly an open front vowel /ä/, but also a reduced vowel /ə/ with front and back allophones. In noninitial syllables this corresponds to vowel-harmonic /e ~ o/ in Erzya, or in some dialects instead /i ~ u/. There are two main reconstructions of the Proto-Mordvinic situation: the Finnish/Hungarian approach, which posits Moksha-like original *ə, and the Russian approach, which posits Erzya-like original *i ~ *u. In terms of phonetic typology, the latter seems simpler from the Mordvinic dialectology viewpoint: *i ~ *u > /ə/ is trivial vowel reduction, while *ə > /i ~ u/ is rather less common, and also runs counter to typical vowel inventory trends in the region. [1] The former, on the other hand, seems simpler from the wider Uralic viewpoint: PMo *ə quite typically continues PU unstressed *a ~ *ä, and routing reflexes like *kota >> /kudo/ ‘house’ thru a stage *kudu with a close vowel appears unparsimonious. I have tended to follow the *ə reconstruction already since I mostly talk about Mordvinic within the Uralic context. A second motivation that appears reasonable to me are Erzya dialects where PMo *e *ä yield /ä e/ (minimal pair: /käď/ ‘skin’, /keď/ ‘hand’ ~ Mk. /keď/, /käď/ respectively), a “flip-flop” that seemingly demands some feature in addition to height for distinguishing these. We could posit that *e, *o were, at least phonetically, reduced vowels *ĕ, *ŏ, which would then also suggest that *ə was their unstressed neutralized allophone.

But most of this seems to be further complicated by a look at initial-syllable /ə/ in Moksha. This most typically corresponds instead to /i/ and /u/ in Erzya, including in dialects with /e ~ o/ corresponding to Mk. non-initial /ə/; sometimes we even find both close vowels represented in Erzya dialects; relatively often Uralic sources of such vocabulary would predict **e or **o; sometimes we find loss of the vowel altogether, either in just Erzya or also in Moksha dialects. A few examples:

  • Er. /kirta-/, /kurta-/ ~ Mk. /kərta-/ ‘to singe, scorch’ < PU *kor(p)-tta- (predicted PMo **kurtə-);
  • Er. /turva/ ~ Mk. /tərva/ ‘lip’ < PU *turpa (predicted PMo **torva);
  • Er. /troks/, /truks/, /turks/ ~ Mk. /tərks/, /turks/, /truks/ ‘across, thru’ < PU *tora-ksə (predicted PMo **turəks)
  • Er. /srado-/, /strado-/ ~ Mk. /səradə-/ ‘to be strewn’ < PU *sira- (predicted PMo **sora-).

Generally I’ve seen the /i/ ~ /ə/ and /u/ ~ /ə/ correspondences explained thru new secondary vowel reduction in Moksha. But this really fails to explain why we should have any doublets like /kirta-/ ~ /kurta-/ within Erzya as well. Given this and the cases of syncope, my current hypothesis is that perhaps we should be treating Moksha /ə/ as older, already Proto-Mordvinic, and the Erzya full vowels as secondary. This would obviously confirm that unstressed /i ~ u/ in Erzya also has to be secondary compared to Moksha /ə/; but this comes at a cost: it would also seem to mean that we now have some reason to suspect a contrastive Proto-Mordvinic *ə at least in the first syllable. Many, though not all, cases of such an *ə seem to be further followed by a full vowel /a/. Stress retraction onto full vowels is typical in the region, and so instead of setting up a new vowel quality contrast, a stress contrast might be possible: *tərvá = */tOrvá/ for ‘lip’, versus e.g. *tólga (= Er Mk /tolga/) ‘feather’. Non-initial stress placement like this is in fact attested from both Erzya and Moksha. — But then what of cases like ‘across’? Would we also need to set up contrasts like *təróks = */tOróks/, versus *mórə = */mórO/ ‘song’ (> Er /moro/ ~ Mk. /mor/)? Or even, since reflexes like /turks/ also occur (but not ˣ/turoks/, ˣ/təruks/ etc.), do we perhaps need to set up a syllabic *r̥ here??

All of this should be also further compared with words showing syncope in both Erzya and Moksha. If first-syllable *ə was allowed in Proto-Mordvinic, it seems quite possible to me that words like Er. /pŕa/ ~ Mk. /pŕä/ ‘end, head’ < PU *perä (predicted PMo **piŕə) should be reconstructed not just yet with an initial cluster, but rather as something like PMo *pəŕa, and with syncope only incidentally taking place in both languages later on in this kind of auspicious positions, i.e. where syncope would produce a typologically natural initial consonant cluster (the same environment as initial-vowel syncope in Udmurt).

[1] I would propose solving this by routing the /i ~ u/ dialects thru the mainline /e ~ o/ type: after “de-reduction” of *ə to full vowels, these dialects would have gone thru vowel reduction again, but this time not of the centering but rather inventory-reducing type: unstressed *e > /i/, *o > /u/. This is well paralleled by unstressed /e/ × /i/ > [ɪ] in Russian, which of course has been the most significant contact language of Erzya for the last several centuries already.

No mid vowel dissimilation in Greek — nor Finnish?

I recently read “Deconstructing ‘height dissimilation’ in Modern Greek” (Journal of Greek Linguistics 3, 2002) by Julián Méndez Dosuna. I don’t really dabble in Modern Greek dialectology, but this struck me as an interesting paper for its methodology regardless, and the lessons seem to apply also more widely.

The story goes: Modern Greek varieties often reflect Ancient Greek /ea eo/ as /ia io/, and while AGk /oa/ was more rare, it can be also reflected as ModGk /ua/. [1] This has traditionally been explained to have come about a process of height dissimilation: [mid] + [non-close] > [close] + [non-close]. JMD however argues for a different pathway. Using /ea/ for illustration, the first stage would rather have been coalescence to a diphthong /e̯a/, followed by unconditional raising of the nonclose nonsyllabic to give /ja/ — both reflexes also attested among the palette of ModGk dialects — and finally re-breaking to /ia/. His main objection is that mid vowel dissimilation seems to be phonetically unmotivated, that explaining it as a means to prevent syllable contraction is too teleological, and that this explanation makes no sense anyway for dissimilation feeding into glide formation (which is the traditional routing of varieties showing /ja/).

I am fully on board with this kind of an approach. It is my experience that dialectologists quite often (1) operate on an assumption of deriving modern dialects directly from a classical/standard variety of the language, and (2) do not have a good knowledge of comparative linguistics besides their own subject. Because of this they can end up proposing all kinds of historically backwards and/or phonologically nonsensical reconstructions or sound changes. Two examples from elsewhere would be alleged /q/ > /ɢ/, /g/ in Arabic dialects (surely rather an earlier split with something like (*kʼ >) *k̰ˤ > *q̰ > /q/ in Classical Arabic versus *k̰ˤ > *q̰ > /ɢ/ > /g/ or *k̰ˤ > *k̰ > /g/ dialectally) [2] and alleged conditional /aɪ aʊ/ > [əɪ əʊ] in Canadian English (surely rather Early Modern English *əɪ *əʊ being positionally retained and only conditionally lowered to /aɪ aʊ/).

If this alone wasn’t enough though, JMD covers also plenty of indirect reasons to prefer a glide formation + breaking pathway. From the Greek dialect data we have the following points:

  • While mid + mid /eo/ can develop to /io/, the sequences /ee/ and /oo/ [3] do not develop to **/ie/, **/uo/, and they instead generally show contraction to simple /e/, /o/).
  • Glide formation explains concomitant stress retraction from e.g. /éa/ to /iá/ in some dialects, and also “regular hypercorrection” from e.g. /iá/ to /ía/ in others; or per JMD rather: stress advancement upon the re-breaking of /ja/ to /ia/.
  • Re-breaking explains the history of dialects where e.g. /ia/ (from earlier /ea/ or not) appears to have given /ja/ only after “palatalizable” consonants, into which the glide is then absorbed; i.e. /nia/ > *ɲja > /ɲa/, but /ðia/ remains unchanged. Per JMD, the latter rather gives intermediate *ðja as well, but reverts to bisyllabic after *ɲj > /ɲ/ coalescence has applied.
  • Also in varieties where the bisyllabic realization remains prescribed, sequences starting with a mid vowel can parse as a single syllable in poetry, and phonetic diphthongs such as [e̯a] can observed in connected speech.

As two additional typological arguments, he notes that mid vowel dissimilation, i.e. raising only before open or non-close vowels, is not well-attested as a synchronic phonological process, and that diphthongs do show a strong cross-linguistic tendency towards fully close endpoints. [4]

I didn’t catch this point being made particularly explicitly, but also linking /e̯a/ and /ja/ diachronically together additionally seems like increased economy over the traditional assumption of two unrelated coalescence processes along the lines of /e̯a/ < /ea/ > /ia/ > /ja/.

This all naturally makes me wonder about Finnish, where mid vowel dissimilation is a classic dialect feature, applying to unstressed /ea eä oa öä/ sequences. These primarily come about following elision of earlier unstressed *-ð- and are primarily found in four morpholexical environments: adjectives in -eA; partitive singulars in -A of nominal stems in -e-, -O-; infinitives in -A of verb stems in -e-, -O-; [5] “contracted” verbs in -A- derived from stems in -e-, -O-. All of these yield /ia iä ua yä/ in a variety of dialects, maybe best known as a feature of South Ostrobothnian, but also attested further north; in a small area in the southwest; and a slightly wider area in the southeast. [6]

Let’s first take a moment to consider if in Finnish, too, the /ia/ type reflexes could have actually followed the same /e.a/ > /e̯a/ > /ja/ > /i.a/ trajectory that JMD argues for modern Greek. Just as in Greek, the intermediates could be partly attested: /ja/ is known from a few southwestern and SOstrobothnian varieties, and some eastern varieties show /ea̯/, trivially close to more hypothetical *e̯a. (These are generally dialects that also show /oa̯/ and /eä̯/ for earlier unstressed *aa and *ää, and in principle one could propose that /eä oa/ actually first assimilate to *ää *aa; but for /ea̯ öä̯/ an explanation like this isn’t possible.) It is also the case that /jV/ > /iV/ under some particular conditions is a widely-distributed sound change in Finnish, e.g. /vjV/ > /viV/ in kavia ~ kavio ‘hoof’ < kavja ~ kavjo < ⁽*⁾kapja. I already think this might apply also in more cases than has usually been realized, and perhaps we could go further still and even assume developments such as korkea > korkja > korkia ‘tall’. Three-consonant clusters like /rkj/ would be rather strange to most Finnish dialects however.

There are also some adjective doublets that could be taken to suggest /eA/ > /iA/ > /jA/. Directly attested are at least eheä ~ ehjä ‘whole’, norea ~ norja ‘pliable’, sorea ~ sorja ‘beautiful’. Similar alternation can be reconstructed also behind at least lakea ~ laaja (< *laɣϳa < *lakja) ‘wide’ and välkeä (← *väleä by suffix exchange) ~ väljä ‘loose’. I am far from certain though about explaining these as phonological doublets. The variants in /-jA/ can be found also in dialects where the soundlawful development is /eA/ > /ee/, e.g. ehjä is found all across Tavastian dialects, and penetrates fairly well into Savonian dialects as well. In at least two cases this alternation even appears just within Karelian, where there is no sign of *eA > ˣ/iA/: kahei (Livvi) ~ kahja ‘coarse, rough’, karie ~ karja ‘coarse, big’. The latter indeed seems to be a specialization of Proto-Finnic *karja ‘cattle; multitude’, i.e. not a secondary development from a **kareda > **karea. My working hypothesis remains that this is mostly a kind of phonetically motivated morphological analogy, and that the forms in /-CjA/ are generally more original.

A final problem is that unlike Greek, Finnish has also original /-CjV/, as in *karja above. Some development to /-CiV/ can be found, but not in all cases. E.g. in SOstrobothnian /-ljV/, /-rjV/ > /-liV/, /-riV/ is regular, but /-hjV/ rather receives an echo vowel, e.g. pohja > pohoja ‘bottom; north’, tyhjä > tyhyjä ’empty’, clearly distinct from e.g. kauhea > kauhia ‘terrible’.

So a coalescence + re-breaking hypothesis runs into a variety of trouble in Finnish. I still would not want to just abandon the argument about vowel height dissimilation being an unnatural sound change though. Another way to fix the situation is possible too: glide epenthesis, followed by raising conditioned by this new glide (a mechanism that JMD passingly reports from Dutch). Thus, I would propose /ea eä oa öä/ > (? [ee̯a ee̯ä oo̯a öö̯ä] >) /eja ejä owa öɥä/ > /ija ijä uwa yɥä/ > /ia iä ua yä/. This has the same benefits of better typological plausibility, and no major problems with intermediate stages. The intermediate /eja ova/ type is again attested, conveniently neighboring both the SOstrobothnian and the southeastern /ia/ areas. Better still, there’s even the benefit that all three changes can be independently attested in Finnish!

  • Glide epenthesis is a widely spread strategy of hiatus resolution in Finnish dialects, clearly especially in stressed syllables (where typically no further general changes apply); possibly also in unstressed syllables, i.e. in cases of the type *kataɣa >> kataja, SW katava ‘juniper’. These, too, might be at least partly epenthetic glides appended to *kata.a, rather than direct reflexes of *ɣ. (However, *-aða > *-a.a > /-aa/ appears to be exceptionless.)
  • Raising of unstressed /e/ to /i/ before /j/ is well-attested all across Finnish, e.g. in actor nouns from e-stem verbs (sure- ‘to mourn’ → surija ‘mourner’). (No similar change applies with a labial glide, though: sanova ‘saying’ never gives ˣsanuva. Some eastern dialects show instead labial coloring, e.g. tuleva ‘coming’ > tulova; perhaps a more natural effect of the labiodental glide [ʋ].)
  • Even today Finnish really shows no distinction between unstressed [ijV] and [i.V]: contrasts such as nauttia ‘to enjoy’ vs. nauttija ‘enjoyer’ are purely orthographic. Subphonemic variation between [u.V], [y.V] and [uwV], [yɥV] also appears, particularly conspicuous after stressed syllables (e.g. standard tauot ‘pauses’ is usually [tauwot ~ tawːot], not [tau.ot]).

This approach would also seem allow to explaining an interesting asymmetry in the small southwestern zone in Uusimaa, which shows only /ea/ >> /ia/ but no /oa/ >> /ua/ (rather /OA/ > /OO/). Here I would note that Finnish definitely has a phoneme /j/ anyway, but no /w ɥ/; maybe this resulted in /eA/ > *ejA but no epenthesis from /oa öä/ to **owa **öɥä. — A similar situation extends also to the southwestern dialects proper, which mostly show /ea/ >> /i/ but /oa/ >> /o/. The western Uusimaa dialects are already known for sharing also other features with SW Finnish, and to me it would seem the best to treat the former as an archaic sister group of the latter, not as an SW-influenced group of the Tavastian dialects (which do not form a single historical subgroup anyway). It seems that either *ia *oa or *ia *oo could be reconstructed as the typical pre-apocope reflexes in SW Finnish.

Altogether one very broad point this case study shows that while the phonological makeup of the Finnish dialects has been well-documented by now, the actual history leading up to them remains open to analysis.

[1] AGk /oe/, when not simply retained, gives however rather ModGk /oi/, or more exactly, the diphthong /oi̯/ = /oj/.
[2] An intermediate voiced stage for “*q” also explains why is Proto-Arabic *g spontaneously fronted to something like /ɟ/ or /dʒ/ in most varieties.
[3] I.e. bisyllabic [e.e], [o.o]; not to be confused with the AGk long vowels η ω /eː oː/ which I believe give short /i o/ in ModGk universally.
[4] I could quibble a bit with this last argument though. Certainly closing diphthongs such as /ai/, /au/ are ubiquitous, but it is not clear to me if close-to-open diphthongs like /i͡a/, /u͡a/ are actually substantially more common than mid-to-open diphthongs like /e͡a/, /o͡a/. But also variation between the two is common, and in all cases known to me, mid-to-open is moreover more archaic than close-to-open (thus e.g. Eastern Finnic, Western Mansi, Northern Samoyedic, several Samic varieties). This diachronic universal will be at least as good for the purposes of his argument, if not better, than JMD’s alleged synchronic universal.
[5] Verb stems in -e- are for some reason not covered by Kettunen’s dialectal atlas, perhaps since quite a few of them have instead consonant-stem infinitives, showing either assimilation of earlier *ð (pure- : purra ‘to bite’, tule- : tulla ‘to come’, mene- : mennä ‘to go’), late retention of *ð (näke- : nähdä ‘to see’), or blocking of lenition from *t to *ð to begin with (pese- : pestä ‘to wash’).
[6] The majority development, including modern colloquial Finnish and also most other Finnic varieties where deletion of medial *ð applies, is to instead contract these to long mid vowels /ee OO/, possibly followed by other changes such as diphthongization to /ie UO/ (thus e.g. Karelian proper) or shortening to /e o/ (thus e.g. Estonian).

Followup anti-etymology: ? *täCə ‘birch bark covering’

In the last post I parenthetically mentioned a PU root “*täsə (UEW: *tisɜ)” ‘birch bark covering for a teepee’. This has been previously reconstructed from very scanty evidence: Komi /tis(k)a/, Forest Nenets /tʲēt/ ([tɕi͡et]), Kamassian [tʰɤʔ]. The latter two point to a Proto-Samoyedic form *t¹ät¹, which per the Komi comparison would have to be equal to plain *tät (*t¹ stands for *t or *č, which cannot be distinguished without Selkup). Samoyedic sometimes seems to have irregular *ä for PU *i (e.g. *mäńä ‘daughter-in-law’), but I think this word does not need to be one of them: this can be also the inverse, with Komi /i/ secondarily from *ä, a development attested also in e.g. /ki/ ‘hand’ < *kätə.

UEW makes the same mistake, I think, in one other case too: Permic *li ‘sap, phloem’ ~ Kamassian [lēji] ‘sap’ has been reconstructed as PU *lijɜ, where *läjə or *läŋə would seem better to me (but unexpected retention of *l- in Samoyedic and the unexplained (suffixal?) final vowel leave me suspicious on if this comparison, too, is correct at all).

I realize today that the consonantism of my alleged *täsə requires more thought, however. This reconstruction as such should give voiced **-z- in Komi, not voiceless /s/! The variant with /sk/ is however a good hint that the word probably comes about thru some degree of suffixation. I can think of at least three options, none of them entirely unproblematic however:

  1. a PU root *tätə, continued directly in Samoyedic but suffixed to *tätə-ksə > *ti-s(k)-a in Permic, with regular loss of medial *-t-;
  2. a PU root *täsə, continued directly in Samoyedic but suffixed to *täs-kä in Permic;
    • but from early *ä-ä I would rather expect **ɤ or **e in Komi;
  3. a PU form *täkə-ksə / *täxə-ksə / *täwə-ksə, with the 2nd syllable regularly lost in both branches and the nominal suffix *-ksə reduced to *-t in Proto-Samoyedic;
    • but I would expect *-tə, as also found e.g. in *suksə > *tutə ‘ski’ or Jussi Ylikoski’s recent comparison of northern Samoyedic predestinative *-tə with the Finnic translative *-ksi.

If any further cognates were found elsewhere in Uralic, they should be able to help clarify the situation. Quick checkups of Mordwinisches Wörterbuch and Yhteissaamelainen sanasto and mentally going over the Finnish lexicon have all come up negative, at least. Common Ugric “*täŋɜ-tɜ” ‘quiver’ has some vague resemblance (birch bark is a reasonable material for quivers) but probably not enough. It’s also one of the cases with Ob-Ugric *ɣ ~ Hungarian g, which I think is a point against native Uralic origin, ditto **-tɜ which is not a known nominal suffix in Uralic.

Looking outside of Uralic will be a worthwhile check too. I am firstly reminded of Indo-European *(s)teg- ‘cover, roof’ (> German Dach, Greek (σ)τέγος, etc.), which would be a fair match for my third reconstruction as *tä{k|x}ə(-ksə). Routing a loanword into Samoyedic would require a reflex in Indo-Iranian or Tocharian though, and going by standard references neither of them seems to have any kind of a basic noun reflex of this root. The Uralic support is also much too shaky for me to consider any kind of ancient Indo-Uralic cognate status, in case this doesn’t go without saying. So no progress here either.

A better lead seems to be found towards the east. A quick lookover of Turkic has proven similarly unproductive; but in Tungusic we finally find *tüksa ‘birch bark covering for a house’, an exact semantic match with fairly close-by shape. The Komi word could be actually interpreted as a relatively recent loanword from the Evenki reflex /tiksa/. The sound substitution to /sk/ would be curious, as if recapitulating the Proto-Permic metathesis of inherited *ks, but this is really not any worse of a problem than the issues in the comparison with Samoyedic. Morphologically then this comparison indeed looks better! While Komi /-a/ is a known derivational suffix, it productively forms only adjectives. Bisyllabic nouns ending in /a/ are often instead loans, e.g. /ćarla/ ‘sickle’, /koba/ ‘spinning wheel’, from Turkic; /kaľja/ ‘type of beer’, /ľuśka/ ‘spoon’, from Finnic. — Komi and Evenki are not known as close neighbors, but both have been notable trade languages in western Siberia before the expansion of Russian, and a few other Tungusic loanwords in Komi have been already proposed as well.

It still would be good to have additional evidence for *ks → /s/ or /sk/ in loanwords into Komi however. The cluster /ks/ is not categorically shunned, and it can be found e.g. in /ɤksɨ/ ‘prince’ (probably ← Alanic, cf. Ossetic /ɐχsin/ ‘lady, princess’, though some details of transmission remain unclear).

I have also not managed to scrounge up any other etymology for the Samoyedic words. Regardless, going by the to the Komi ← Evenki loan hypothesis, I now lean towards not reconstructing this word for Proto-Uralic after all.

Probably not a valid etymology: *čäččä ‘birch bark’

The Proto-Finnic word for ‘birch bark’ was *toohi (consonant stem: *toohë-, partitive *tooh-ta), continued directly in Finnish and Karelian tuohi, Veps toh’. The southern Finnic languages mainly show derivatives: Votic toho, standard Estonian toht(u-), Võro tohk(o-), Livonian tū’oigõz (however EES reports a seemingly underived form tooh from someëlsewhere in South Estonian).

The usual etymology, known for closer to 150 years by now, has been to connect this with Latvian tāss, Lithuanian tošis of the same meaning. We could indeed derive *toohi from earlier *taaši or *taašə (your call on the age of the shift of final *-ə to *-i), which will be immediately easily compareable with an East Baltic *tāšis. Already given the abundance of Baltic loanwords in Proto-Finnic, versus the rarity of Finnic loans reaching Lithuanian, generally the assumption has been that this word, too, comes originally from the Baltic side.

There does not seem to be an immedate Indo-European or even Balto-Slavic etymology though. Basic etymological references suggest derivation from √teš- < PIE *tetḱ- ‘to cut’, but this at first looks like only a semantically vague root etymology. There may well be evidence to further support it in Baltistic literature… but how far would exploring an origin on the Uralic side go?

Looking only at Finnic, *toohi actually has a native enough look, paralleling nouns like *sooli : *soolë- *sool-ta ‘gut’ (from pre-PF *śaali < PU *śalə). In a wider Uralic context just one feature is unexpected: the long vowel preceding *h < *š. As per current understanding, the long nonclose vowels *oo, *ee in Finnic first arise by what I call Lehtinen’s Law: the lengthening of *a, *ä before sonorants in *ə-stems, while before obstruents they remain short. Retention is clearly supported before *k (*käki : *käke- ‘cuckoo’, *mäki : *mäke- ‘hill’, *näke- ‘to see’, *väki : *väke- ‘power’) and *s (*asë- ‘to be located’, *kasi : *kasë- ‘dew’). For *p, *t, *h there is only one example each (*käci : *käte- ‘hand’, *lähe- ‘near, close’, *läpi : *läpe- ‘hole, puncture’), but still no clear counterexamples. I’ve proposed that the stem type *CAATi > *CEETi in Finnic (likewise *CAACA > *CEECA) originates precisely thru IE loanwords, including *toohi < *taaši.

And there remains a little bit of room for dout. Interestingly the stem *lähe- does not go back to older *läšə-: instead it appears to be a case of *s > *h, per the evidence of forms like Fi. läsnä ‘near, present’, an archaic locative in *-nA that also finds an exact cognate in Mari *lĭSnə > Hill /lišnə/, Meadow /ləšne/ ‘near’. [1] So in principle there is an opening: it could be proposed that, for whatever reason, pre-Finnic *š also triggers LL.

Actual candidates for Uralic cognates would be still needed to get anywhere further with this speculation. Some pre-Neogrammarian sources (Castrén, Donner) have compared tuohi with forms for ‘birch bark’ like Udmurt /tuj/ (< Proto-Permic *toj), Tundra Nenets /ta͡e/ (< Proto-Samoyedic *təj), but a sound correspondence *h ~ *j is by modern standards clearly untenable and the vowels don’t play nice either. [2]

Something slightly better can be however found in Mansi, where ‘birch bark’ is *šääšə (South /šääš/, Central /šöäš/ ~ /söäs/, North /saas/). This has been compared with Khanty *siińć (Eastern) ~ *seeńć (Western) ‘id.’, but this probably cannot be correct due to the medial consonant mismatch. By comparison with Finnic though we could trace the Mansi word back to a PU form *čäččä instead. Everything other than the Finnic long vowel would regularly follow known sound laws: *ä-ä > *ää-ə, degemination and *č > *š in Mansi; *ä-ä > *a-ə and *čč > *h in (pre-)Finnic. For initial *č-, traditionally the Finnic reflex has been assumed to be *č- > *š- > *h-, but the evidence is not strong, [3] and *č- > *t-, parallel to clearly regular *-č- > *-t-, has been also proposed in recent times. So have we now managed to uncover the Proto-Uralic term for ‘birch bark’?

While this new etymology could formally reach at least a level of “nonprovably regular”, I still think it is not likely to be correct. There are at least four red flags…

– The first is of course the fact that we have also the option of a competing etymology from Indo-European on the table, even if this lacks the benefit of being semantically exact all the way down.

– The second I’ve already pointed at too: the hypothesis that *-Ašə > *-AAšə in pre-Finnic is not a good fit for the known historical phonological framework of Finnic. I do not expect any additional supporting future evidence to be findable either, as the only other Finnic stem of the shape *CEEhi is *voohi < *aaši ‘goat’, an obvious Baltic loanword, this time with good IE provenance (Lt ožys, Lv āzis < *āžis << PIE *h₂aǵ-). I still think it is legitimate to sometimes propose “nonprovably regular” sound changes, but this is on the condition that they should make sense phonologically.

– Third, there would be issues with relative chronology. While PU *čč ultimately gives *h in Finnic, it is not clear if there ever was a stage with *š that even could feed into Lehtinen’s Law. Kallio proposes for this instead a route *čč > *tš > *šš > *hh > *h that makes rather more sense to me. Finnic tolerates geminate *ss just fine, even if its etymological sources are scarce (in Proto-Finnic basically limited to the inessive ending *-ssA < *-snA and some onomatopoeia), and earlier on there probably would not have been any problem with a transient *šš either; while geminate *hh would be typologically a bit more unusual. (Anything like starting from *čäšä and proposing an ad hoc assimilation to *čäčä in pre-Mansi would not be progress either.)

– Last, a lexicological point. While Proto-Uralians obviously must have known the material, the big picture is that ‘birch bark’ is etymologically highly unstable across Uralic: basically every branch has its own term with no clear, unambiguous cognates. Picking any one as the actual primary PU term would be guesswork, and it is not impossible that there even was no single PU word for ‘birch bark’ at all, only some compound or analytical expression along the lines of *kojwan_karə ‘bark of birch’. For that matter, other more or less specialized ‘bark’ terms have wide variety too. This should not be a huge surprize, as terms for the natural enviroment are typical substrate vocabulary. The Baltic etymology for the Finnic word fits well enough in this pattern (it has indeed been remarked long since that terms for the natural environment are common among Baltic loans in Finnic). We can also at least hypothesize that similarly the discrepancy between Mansi *šääšə ~ Khanty *sii/eeńć could be due to the words coming from two related but different substrate languages of western Siberia; say, pre-Mansi #sɛčV ~ pre-Khanty #senčV. If everything else was in order, a binary comparison could be acceptable, but a comparison drawn from a large pool of candidates that still remains messy is evidence for the similarity of *taaši and *čääčə being accidental.

(Amusingly enough, several words for finished birch bark products have better odds of being reconstructible for PU; e.g. *d₂äŋäs ‘small box made of birch bark’, *küčä ‘drinking vessel made of birch bark’, *täsə (UEW: *tisɜ) ‘birch bark covering for a teepee’.)

So altogether my novel Finnic–Mansi comparison ends up providing more heat than light; it is not a hill I want to die on or even really risk getting injured on. Hopefully still worth putting out there as a humble blog post though. It might be a good illustration of the repeating dead ends and close calls that come up in daily etymological research, but which will be generally left invisible in published works. (And who knows! while I won’t be holding by breath, it’s always a possibility that someone will eventually discover some other way still to bridge the problems in the idea.)

[1] This is though the only example that would retain a trace of the old consonant gradation pattern *s : *h root-medially. Elsewhere we find this alternation only in suffixal gradation (e.g. nominals of the type *taivas : *taivahë- ‘heaven’, *kapris : *kaprihë- ‘deer’) while root-medial *s does not show any productive qualitative gradation (*pesä : *pesä-n ‘nest’, *asë-tta- ‘to place’, and not **pehän, **ahëtta-). This could rouse suspicion for exploring other directions of explanation, such that perhaps läsnä is actually in origin something like a haplological inessive *läšə-snä > *lä-snä, ditto Fi. dialectal lästä ‘(from) near’ a haplological elative *läšə-stä > *lä-stä rather than an archaic locative edit: ablative *läs-tä. Analogy with *tä-snä, *tä-stä ‘(from) here’ could be an option too. But this too would be an explanation relying on idiosyncratic, essentially ad hoc analogies that cannot be decisive.
[2] I’m not even convinced that the Permic and Samoyedic should be thought of as cognate with each other. UEW proposes *tojɜ, but neither *o > *o in Permic, *o > *ə in Smy nor retention of *-j would be regular.
[3] More exactly, the evidence for actually reconstructing *č- and not *š- to begin with in the proposed examples is not strong. E.g. there do not seem to be any examples where Finnic *h- would correspond to a clear affricate *c- in Samic, unless we start counting otherwise poor comparisons like PF *hanki (< *čaŋkə?) ‘snow cover’ ~ PS *cōŋoi (< *čaŋoi?) ‘id.’ — Even the general development of initial *č- in Uralic seems to me like it still requires further study; for example Mari too shows some evidence for deaffrication *č- > *š- vs. also some for retention as *č-.

Phonology squib: Conditional *h-loss in Estonian

The history of Proto-Finnic *h provides several illustrative examples of the diachronic development of “laryngeal” consonants. The primary overarching pattern is a north(east)–south(west) cline of gradual loss. This demonstrates that *h-loss processes have arisen independently in multiple lineages, and in multiple layers in many of them:

  • Karelian, Ludian–Veps: generally retained in all positions
  • Northernmost dialects of Finnish: retained but with several metathesis rules
  • South Ostrobothnian Finnish: retained in most positions, but word-final *-eh has been analogically generalized to -es
  • Ingrian, most remaining dialects of Finnish, South Estonian: generally retained in the initial (stressed) syllable and following it, lost after unstressed syllables; additionally word-final retention in some SE
  • North Estonian: retained in the initial syllable and in *CVhV, lost after unstressed syllables and in *CVRhV
  • Votic: retained in *CVhV and *CVhCV, otherwise lost
  • Livonian: *h > [ʔ] (“broken tone”, “stød”) in *CVhV and *CVhCV, otherwise lost

Further detail exists still. One such case is standard North Estonian, where we find word-initial loss in several words. The traditional explanation attributes this to dialect borrowing. There are indeed North Estonian dialects showing complete loss of word-initial *h, so there’s nothing impossible about this. Dialect borrowing would be moreover partially paralleled by the example of word-initial /h/ in Votic, originating in Finnish and Ingrian loans (obvious also by other markers in some cases).

It however seems to me that in Estonian a clear sociolinguistic motivation for dozens of *h-less loanwords from folk language into the literary prestige standard is lacking. We can contrast this with the early development of the Finnish literary standard; despite having Turku as its initial seat of development, standard Finnish has generally shunned specifically Southwestern dialectalisms. Instead the effect has been to “dilute” the dialect of Turku towards standard Finnish and away from the rural SW dialects. [1] Also pronunciation respelling seems unlikely to be the main mechanism; people usually tolerate entirely silent letters quite well (there does not seem to be major pressure to respell kn- in English, h- in French or Spanish, lj- in Swedish, etc.)

I see instead evidence for a further conditional sound change: *h- is lost preceding another *-h- + a voiced segment, i.e. in *hVhRV, *hVhV.

The *hVhRV case has a good half a dozen examples and no counterexamples that I can find:

  • ahel ‘chain’ < *ahl < *hahla (cf. Fi. haahla; ← Germanic)
  • ehmes ‘fluff, down’ < *hehmes (← Baltic *šeusm-)
  • ihn ~ hihn ‘strap’ < *hihna (cf. Fi. hihna; ← Baltic)
  • ihne ‘stingy’ < *hihneh (← Baltic *šikšn-)
  • uhmer ‘mortar’ < *huhmari (cf. Fi. huhmar)
  • ühm ~ hühm ‘slush’ < *hühmä (cf. Fi. hyhmä)
  • õhv ‘heifer’ < *hëhvo (cf. Fi. hieho)

For the *hVhV case there is only one really obvious case + another that shows secondarily inserted medial *h:

  • iha sleeve’ < *hiha (cf. Fi. hiha)
  • ihuma ‘to whet’ < *hiho- < *hi.o- < *hijo- (cf. Fi. hioa, dial. hijoa)

but I suspect that also some cases of *h-loss from *hVhTC belong here, which may have lost their *h in the weak grade:

  • uht : gen. uha ‘swidden’ < ? *huht : *uha < *hukta (cf. Fi. huhta)
  • uhtma : 1PS uhan ‘to rinse’ < ? *huhtma : *uhan < *huhta- (cf. Fi. huuhtoa)
  • õhk : gen. õhu ‘air’ < ? *hõhk : *õhu (cf. Fi. hehkua ‘to radiate, emanate’; hohkua ‘id.’)
  • õhkama : II inf. õhata ‘to sigh, emanate’ < ? *hõhka- : *õha- (cf. the previous)

A general loss of *h also in *hVhTV cannot be the case, per hahk ‘gray; eider’. Potentially the vowel difference could matter, but I would not assume this by just one example.

Phonetically, dissimilation of h…h would be natural. But why should the identity of the following segment matter? I think that allophony of /h/ is involved: at least in Finnish there is variation between voiceless [h] (word-initially and before voiceless consonants) ~ at least partly voiced [ɦ] (between voiced segments). If this is or has been the case in Estonian, too, then we could assume that *[hVɦ] first assimilates to *[ɦVɦ], followed by *[ɦ] > ∅ word-initially.

The Central Finnic (North Estonian & Votic) innovation *Rh > *R, where *R ∈ {n, l, r}, could be also naturally routed thru an *[Rɦ] stage. This is not strictly necessary though, since there is no contrasting **th or the like.

— There is slight evidence also for another even more minor *h-loss sound law: juus ‘hair’ < *hibus and juuk ‘fine’ < *hiukka seem to involve *hiu- > *hjuu- or *hʲuu- > juu-.

I do not know offhand if any traditional Estonian folk dialects would follow exactly the *h-loss patterns I’ve identified here. Still, even just acrolectal standard (North) Estonian probably could have gone thru some sound changes all of its own, early on in its development.

[1] Partly this is also because major cities will draw in population from wider out than just from their immediate environments, as demonstrated in the Finnish case by the so-called “Tavastian wedge” dialects that arisen along the old Turku–Hämeenlinna road, in the parishes of Kaarina, Lieto, Marttila, Kaski etc. in the central parts of Finland Proper.

Nonregularity in North Caucasian

Due to a recent ZBB discussion I ended up re-reading Sergei Starostin’s A North Caucasian Etymological Dictionary Preface. This is one of the more worrisome cases of “Moscow School” phonological tarpits: there is no doubt about Northeast Caucasian being a valid family, and I would also think the relationship with Northwest Caucasian is sufficiently established… but the reconstruction the late Starostin advances for the family sure looks like it has too many bells and whistles, with features like six laryngeals that end up almost randomly reshuffled in the descendants, nearly all obstruents having a plain/geminate distinction orthogonal to phonation, or abundant *Cw clusters at all POAs other than labial. I count 132 basic sound correspondences plus some fifty-odd cluster correspondences. Even spread across two root consonant positions in 2300+ reconstructions, in a reconstruction scheme of this kind there are bound to be reflexes that aren’t actually well enough established.

Probably most fixes to this reconstruction would also have to be etymological. Likely there are correspondences representing areal loanwords rather than original inheritance, or correspondences used to stitch together unrelated vocabulary. Just checking for not-really-regular correspondences would be a good start though.

I’ve picked for a quick case study *pC clusters. These appear word-initially, supposedly evolving from certain *Cw clusters, in two far ends of the family: Nakh and Khinalug. The asserted sources are as follows:

  • *ff > N *pχ, Kh. /px/
  • *ćw > N *ps (Kh. /cʼ/)
  • *św > N *ps (Kh. /s(w)/)
  • *śśw > N *ps, Kh. /pš/
  • *cw > Kh. /ps/ (N *c)
  • *čw > Kh. /pš/ (N *č)
  • *xxw > N *pχ
  • *qw > N *pħ (Kh. /q/)
  • *qqʼw, ɢɢw > N *pħ (Kh. /qʼ/)
  • *χχw > N *pħ, Kh. /pχ/

Also a cluster *bʡ in Nakh has three origins asserted: *qʼw, *ʡw and *hw.

How many of these developments are actually regular once we look into it? Put in your bets now…

(1) Nakh *ps is found in five examples. Every single one of them has a different reconstruction! i.e. none of them can be considered regular. Besides the three expected cases of *ćw, *św, *śśw, there’s one of *cc’w (alleged regular Nakh reflex *t-) and one of *ćʼ with no labialization even (alleged regular Nakh reflex *cʼ). Tsk tsk tsk. For that matter, two cases have NWC cognates with a presyllable *pə-, supposedly a prefix. My bet would be that this is what really occurs in the Nakh examples too.

(2) A Nakh *pš turns out to exist in one example with *čʼw, whose regular Nakh reflex is allegedly plain *š-. (Maybe another likely prefix case?)

(3) Nakh *pχ is found in four examples; just one of *ff, so irregular in any case. There are no more than two initial and four medial instances of *ff reconstructed altogether. The other case of initial *ff- actually has a Nakh reflex too, but showing *ħ-! — The three cases of *xxw do not look that much better. NWC has *xw in two cases (and also for the *ff case), secondary *x́w in one, so this at least seems to work. Lak has one case of /xx/, one case of /xxw/ and one case of /šš/; the last supposedly by late palatalization from *xx … but, unfortunately, the one example of /xx/ occurs before /i/? Andic has one case of *xw, one case of *ɬw.

(4) Nakh *pħ rakes together a seemingly respectable 13 examples. But they diverge to nine reconstructions, of which most occur just once: *q *qw *qq *qqw *qʼw *χχw *pʼɦ. The last is a cluster type (obstruent + laryngeal) that seems to be relatively common in the proto-lexicon but is strangely not at all commented on in the Preface. As for the others, only the *qw and *χχw cases seem even expected. For the others the allegedly regular Nakh reflexes are *q > *q, *qq > *q/*ʁ,  *qqw > *q/*ʁ, *qʼw > *bʢ. (There is one appeal to labiality metathesis: *qarćʼwV > *qwarćʼV before *qw > *pħ? But this is itself clearly ad hoc rather than regular.)

Our last hope for Nakh *pC are thus the clusters *ɢɢw, *qqʼw; the first represented by four examples (one of them with also a laryngeal: *ɢɢHw), the second by two examples (one of them with a laryngeal). Starting with *qqʼw, and skipping over subfamilies reflecting only one instance, in Tsezic we have one case of *qʼw and one of *qʼ; in Lezgic, one case of *qʼˤw and one of *qʼw (respectively). Inconsistent secondary articulations are not the most major problem maybe, but then the latter etymology additionally requires metathesis from *tʼHalqqʼwV to *qqʼHwaltʼV in Nakh. — Moving to *ɢɢw (when’s the last time you heard of a language that has geminate voiced uvular stops, incidentally?): Tsezic has one *q, one *qw; Dargwa has one *ʁˤw, one *ʁˤ and one *qqw; Lezgic has one *qqˤ, one *qqʼˤ, one *qqʼˤw. One case has a presyllable *mu-, and it would be possible to speculate that actually this is the real source of the Nakh cluster.

(5) Nakh *bʡ is found in also respectable eleven examples (plus one word-initial one). Three of them are from *ʡw, which ends up reflected reasonably regularly: the reflexes also include two cases of Andic *ħ and one of *ħw, two cases of Tsezic *ħ, three cases of Lak zero, two cases of Dargwa *ħ, two cases of Lezgic *ʔw. A small ray of hope, maybe…

Four cases from *qʼw (three of them with also a laryngeal: *qʼHw) look promising too. But the distribution of these etyma is terrible: only Lak and NWC also reflect more than one of them. The former has one case of *w, one case of *qʼ; the latter has in both cases *qʼ, though the second one with a presyllable *p-, again casting doubt on analyzing Nakh *b as continuing *w.

In the waste pile of protoforms attested only once, we have *ʔw, *hw, *ɦw, *bɦ (with the *hw case showing a presyllable *ba- in NWC).

(6) A Nakh *bʕ appears too. One supposedly from PNC *wH, another two from PNC *bʕ (of which one case “with some metatheses and aberrations“). The latter two do have *pp in Lezgian.

(7) Khinalug /ps/ is found in two examples, one of them indeed *cw and the other *čw. For *cwaʡmV ‘bear’, NWC adds (is supposed to metathesize) a presyllable *mə-; maybe this is once again what’s really going on.

(8) Khinalug /pš/ is found in four examples, going back to *śśw twice, *čw once and also *chw once (I think that’s an alveolar affricate + laryngeal sequence?). Lak has /š/ in both cases of *śśw; NWC has a presyllable *pə- in one of them.

(9) Khinalug /px/ is attested just once; enough said.

(10) Khinalug /pχ/ is attested once word-initially from *χχw as promised, also once word-medially from a sequence *-waχχ-.

So the basic toll is: the Nakh *pC clusters regularly correspond to nothing whatsoever across Northeast Caucasian. Only three of the eight alleged regular sources are actually regular even from PNC to Nakh (“soundlawful regularity“, one of the weakest types). For *bʕ we can find a weak two-example correspondence with Lezgian *pp, for *bʡ one just barely more substantially regular set of correspondences. Khinalug /pš/ finds one two-example correspondence with Lak /š/.

This survey does not fill me with hope for either the current proposals being correct or for the ability to find new, stronger phonological solutions with future work. Probably this is bound to happen to some extent in comparative work between languages with highly complex phonologies. I however wonder now just how much else does this result apply to.

