Phonological Cores and Average Regularities

Some thinking out loud on the formalization of comparative and historical phonology.

As in most work I’ve seen on the topic, I presume that an etymological corpus of word comparisons has already been given, additionally also aligned segmentwise. [1] The usual question at this point is how to proceed with reconstruction. I however largely assume even this as as given. The main questions I would ask are: how much should we trust a reconstruction given for the data? How coherent it is internally to begin with, and how does it match against other reconstruction possibilities?

This is not a very relevant question for developing automatic reconstruction methods, [2] but better understanding of these issues will be practical in assessing existing proposals. Especially the ones that cover substantial amounts of data but are regardless disputed on every front, e.g. any variant of Altaic or Nostratic.

Foundations and Cores

The basic concepts of this post:

  1. A phonological foundation is a set of word comparisons where every sound correspondence is regular within the set.
  2. A phonological core is a minimal phonological foundation, i.e. a phonological foundation such that no strict subset of its word comparisons is a phonological foundation anymore.

Note that these definitions are only with respect to etyma, not with respect to the number of reflexes. A comparison of two-reflex etyma could be exactly as regular as a comparison of ten-reflex ones; a compact foundation comparing only two languages could be exactly as regular as as diffuse foundation comparing ten languages. For now, all correspondences still have to be regular between all applicable language pairs specifically. [3]

These concepts have been phrased purely in terms of sound correspondences. Actual reconstruction requires consideration as well, though. A few initial definitions for this:

  • A reconstruction is a set of word comparisons between at least three languages, with exactly one of them being a special type of language called a proto-language, with the following properties:
    1. Every comparison includes a proto-language cognate (called a proto-form).
    2. The proto-language is not given by external data, but can be adjusted at will.
      (I.e. this is the “operational” proto-language, not the inferred “real historical” proto-language. By this definition, Latin is not Proto-Romance, at most identical to Proto-Romance.)
  • A historical phonology is a partially ordered set of sound changes (I will not go here into rigorously defining a sound change) with the following properties:
    1. Sound changes are ordered with respect to one another only if they interact ≈ roughly: take the same segment as input or as conditioning.
      (I.e. we abstract away the difference between historical phonologies that differ only in the relative chronology of changes that do not interact. In the absense of other details, *śëta > *śata > sata and *śëta > *sëta > sata should be considered identical histories.)
    2. The bottommost sound changes start from the proto-language.
    3. The topmost sound changes yield the other languages as recorded in real data.
    4. For any sound change applying to all languages, there is at least one sound change postdating it that does not apply to all languages.
      (I.e. the proto-language is still indeed the last common ancestor, not merely any common ancestor.)

The latter could be better called a “comparative historical phonology”… since real historical phonologies often take additionally also loanword evidence into account when establishing relative chronologies. And we could also define internally reconstructed historical phonologies that replace condition 4 with a redefinition of the proto-language. Gotta learn to walk before running, though.

I have on purpose defined these two concepts without referencing the concepts in my first list. It is more profitable to instead treat these as orthogonal, and to speak of concepts such as a foundational reconstruction = a reconstruction whose underlying set of word comparisons, the proto-language excluded, is a phonological foundation. There are many proposed reconstructions, and we do not want to suggest that they are by definition regular in my highly formal sense! As seen below, perhaps they do not even need to be.


Some concepts in hand, let us now go over a simple example. One clean phonological core within Uralic is presented by the following four etymologies between Finnish, Northern Sami and Erzya:

  • Fi. kesä ~ NS geassi ~ Er. кизэ /kize/ ‘summer’
  • Fi. pesä ~ NS beassi ~ Er. пизэ /pize/ ‘nest’
  • Fi. kala ~ NS guolli ~ Er. кал /kal/ ‘fish’
  • Fi. pala ~ NS buolli ~ Er. пал /pal/ ‘bit’

(I have stuck here to the most widely spoken members of their subfamilies. The data could be easily also stretched to include further varieties, or rewritten as a comparison of Proto-Finnic, Proto-Samic and Proto-Mordvinic.)

We can easily see that everything is regular: every sound correspondence occurs at least twice — in fact exactly twice; cores with correspondences occurring thrice could only be put together from more holey data. Reconstructions would be easy to suggest too. A phonetically simple approach would be e.g. *kesä, *pesä, *kală and *pală, which is only mildly off from the usual thinking. [4]

However, the data here in effect only allows reconstructing two onsets *k-, *p- and two rimes that I have just called *-esä, *-ală. It does not establish any contrast between the individual segments in the rimes! This means that given just this data, we could also rewrite the rimes in more minimal forms such as *-ele, *-ale, and assume a number of conditional sound changes that apply in all or most descendants (e.g. *l > ⁽*⁾s / e_ in all, *e > a / aC_ in Finnish).

This hence already demonstrates that reconstructions should not be built up from “core reconstructions”: overly limited data leads to overly minimal reconstructions. A four-comparison core is not quite the smallest possible, [5] but obviously most of a realistic proto-language still cannot fit into one. Reconstructions with real phonological labels should probably wait until we have assembled larger phonological foundations — within cores, this work is adequately substituted just by the correspondence patterns themselves.

This phonological core is incidentally also a “semantic core“, with each of the four comparisons showing the exact same meaning in every language. This is probably also a desirable trait in phonological foundations in general, but then not strictly required by the phonological formal side of the Comparative Method.

Comparison Regularity

Using the concepts of phonological foundations and cores, I can now also define a few categories of word comparisons:

  1. A word comparison that belongs to at least one phonological foundation is regular.
    • A core comparison is a word comparison that belongs to at least one phonological core.
    • A regular adduct is a word comparison that belong to at least one phonological foundation, but does not belong in any phonological core (is not a core comparison).
  2. A word comparison that shows regular sound correspondences as established by a phonological foundation, except for one unique sound correspondence, is near-regular.
    • Given a reconstruction, a single near-regular comparison that does not contradict any soundlaw establishable from the reconstruction without this item is (phonologically) nonprovable; as is the new, once-exemplified soundlaw it requires.
    • Given a reconstruction, a near-regular word comparison that does contradict a soundlaw establishable from the reconstruction without this item (would necessitate setting up a new proto-phoneme) is an exception.
    • If there are two or more nonprovable comparisons, such that they are compatible with the same foundation, but only one of them can be added to the foundation as nonprovable (forcing the others to be exceptions), they are competing.
  3. A word comparison more irregular than a near-regular one (with at least two correspondences that are not regular) is simply irregular. We could distinguish further categories such as “2-irregular”, “3-irregular” etc. (with “1-irregular” being what I have just titled “near-regular”), but in practice the case seems to be that, for any sensible morpheme length, already 2-irregular correspondences are too weak to be very useful for linguistic reconstruction at all.

The first and third points of case #2 may sound confusing; in practice it means simply the case where we do not have enough data to establish what the regular reflex of a proto-phoneme *X in language L might be.

Note that a comparison may be at one stage with respect to one phonological foundation, at another with respect to another.


Continuing the previous example, my above-detailed core motivates hunting for other words displaying the same correspondences: Fi. p- ~ NS b- ~ Er. п-, Fi. -a- ~ NS -uo- ~ Er. -а-, etc. The current core is “closed” in the sense that adducing any one additional and different comparison from among the known (West) Uralic comparative material cannot produce a new foundation. Any new item will have to remain nonprovable until at least one other item has also been added. Purely in theory, it could accept a comparison such as Fi. kala ~ NS gilluo, mixing sound correspondences from different “slots”, which would perhaps prompt reconstructing something like *kålä for ‘fish’, *kälå for this. However, across the Uralic languages it happens to be the case that every complete sound correspondence (a correspondence pattern; see below) is strictly restricted to a particular position in the word. [6]

To pick out one new datapoint: Fi. vala ‘oath’ ~ Er. вал /val/ ‘word’ (with also Sami cognates, not found in NS though) can be seen to be regular save for the initial, i.e. it is nonprovable with respect to this core. Minimally it could be proven to be regular by also identifying a West Uralic **vesä. No such comparison is known though (and indeed no words ˣvesä, ˣveassi, ˣвизэ exist in any meaning at all in the three languages); hence more data still is required. One way to do it would be to adduce also the following two comparisons:

  • Fi. kesi ~ Er. кедь /keď/ ‘skin’
  • Fi. vesi ~ Er. ведь /veď/ ‘water’

These two allow reconstructing a third rime *-eti; vala ~ вал and vesi ~ ведь allow reconstructing a third onset *v-; and the rime *-ală and the onset *k- we knew already. Hence all is again in order. Notice that by now we have not only proven vala ~ вал to be regular: it is indeed as much as a core comparison, since also the set *kală, *vală, *keti, *veti constitutes a core!

As noted above, this new second core also only works between Finnish and Erzya. In any Sami variety, clear cognates exist only for *vală and *keti. They will remain nonprovable until we have adduced even more evidence to establish the regular Samic development of *-eti rimes, or more generally *-eCi rimes, and of *v-. However, forms like Southern Sami vuelie /vʉelie/ ‘a joik’ or Skolt Sami -kõtt ‘skin’ regardless already suggest what to look for. This evidence is not hard to locate either, e.g. in Fi. veri ~ NS varra (SS vïrre, SkS võrr) ~ Er. верь /veŕ/ ‘blood’. Though this in turn will send us on a lookaround to establish a few other things as regular, most prominently the development of *r in all involved languages and of *t in Sami; secondly also to verify the correspondences v- ~ v- ~ v- and ï-e ~ a-a ~ õ-∅ between our three Sami varieties that have come up so far. All doable too of course. What we can however see already well enough is how extending foundations is not a question of linear progress.

Segment Regularity

A methodological problem that emerges here, once variable amounts of languages per etymology are being compared, is that regularity for every pairwise comparison may be too much to demand. If e.g. between Finnish and Hungarian there is insufficient evidence to establish a correspondence such as s ~ gy as regular at all (the only example of this is ‘urine’: kusi ~ húgy < PU *kuńćə), but at the same time s ~ /ź/ and gy ~ /ź/ can be both established in comparison with Komi (or s ~ /ńś/ and gy ~ /ńś/ in comparison with Mansi, etc.) — is this not good enough? It feels to me that we should not have to choose between either Finnish, Hungarian, or the still quite regular ‘urine’ etymon to include in a good phonological foundation of Uralic.

Lone pairwise comparison is not good enough for everything, on the other hand. This would make it much too easy to set up some “straggling” members in datasets, such that Chuvash maybe has regular correspondences with Mari but not with the rest of Uralic.

Any pairwise segment comparison still only either is or isn’t regular, and I’ve already defined grades of regularity for an individual pairwise word comparison as well. Even further grades of regularity can regardless be defined, first for individual segments as considered across the entire dataset:

  1. Complete regularity: a segment whose every pairwise correspondence across a foundation is regular.
  2. Biconnected regularity: a segment whose graph of pairwise regular correspondences across a foundation is a biconnected graph (cannot be split into two independent graphs by the removal of one “bridging” language from comparison).
  3. Connected regularity: a segment whose graph of pairwise regular correspondences across a foundation is a connected graph.
  4. Soundlawful regularity: a segment whose every correspondence with the proto-language (hence in a reconstruction, not just any foundation) is regular.

Anything less than #4 is obviously no more regular at all, and instead at most semiregular: since we can attempt to provide a proto-form for every word comparison, gaps cannot be a problem for soundlawfulness.

#4 is, in fact, weaker than even #3. Assume that we had only two examples of the development of *k > /k/ in a poorly known (or just heavily divergent) language such as Muromian. This could arguably suffice to establish the reflex as regular. But then if these two etyma had no overlap in what other languages they have reflexes in, they would not establish any regular correspondence between Muromian /k/ and any other attested Uralic language. I believe it is even possible to create, between no more than three languages, a highly degenerate counterexample dataset that is soundlawfully regular but none of the pairwise sound correspondences are.

Another way to create a highly degenerate but soundlawfully regular dataset is to simply pool together two disjoint foundations — say, with data comparing Mari and Chuvash as one component, data comparing French and English as another. This would still suffice to show that *k- > /k/ is a regular sound change in each language (just not that it is the same *k in all cases…). This is clearly absurd as one proto-language though. I suppose global connectedness regardless of regularity should be required anyway.

In large datasets further grades between #1 and #2 could also prove useful. I do not have the intuition to immediately identify them, though (“triconnected” comes to mind as a naive proposal, but if it would really improve anything much is not clear to me [7]). Before that though, we can consider what are sensible options for datasets with only a small number of languages. For two languages, soundlawful regularity already equals complete regularity. For three, biconnectedness does the same. For four, a double triangle graph is a possible more than biconnected but not fully connected option. But then it’s also already vulnerable to stragglers: e.g. we can find regular correspondences between Swedish, Finnish, Karelian and Erzya, just not between Swedish and Erzya specifically.  More Finnic languages could also be added into the mix to create even more highly connected correspondence graphs that still have the same problem. To eliminate this problem, but not the case of *ńć in Finnish vs. Hungarian, perhaps it suffices to demand the existence of some regular correspondences between all pairs of languages (even if not all pairs of segments).


Did you notice an assumption that I have snuck in unsaid above? It is that pairwise segment correspondences could be linked together into single distinct graphs of a segment’s correspondences. Actually though, this is not trivial, and already constitutes some basic work towards a reconstruction. I can define a few concepts related to this too, while I’m at it. It gives also a a first dip into the topic of conditional sound changes and conditional sound correspondences.

  • Given a set of word comparisons covering at least three languages, and with at least some word comparisons not covering all languages, a correspondence pattern is a grouping of pairwise sound correspondences that assigns a reflex for every language and assigns the pairwise sound correspondences of one multi-language word comparison into the same group.
    • A correspondence pattern is fully attested if every language appears at least once within it.
    • A correspondence pattern is complete if every one of its pairwise sound correspondences occurs in at least some word comparison.
    • (I could again define also biconnected, connected etc. correspondence patterns as weaker options, but I am not sure if this is necessary.)
    • A correspondence pattern is well-supported if there exists a word comparison that displays every member of the correspondence pattern. We could also call this “1-supported”, and define “n-supported” as the minimum number of word comparisons that displays every member.
  • A pre-reconstruction is, in turn, a grouping of either binary sound correspondences (if between two languages) or correspondence patterns (if between more than two languages) by positional environments. This could be further split into a few subtypes too, e.g. conditioning by daughter-language phonetics or conditioning by proto-language phonetics. Already a single correspondence pattern, though, could also constitute a (fairly trivial) pre-reconstruction. — It should probably be demanded that a pre-reconstruction only unites correspondence patterns that have some overlap in their reflexes, not arbitrarily different ones.
  • An unlabeled reconstruction is a set of pre-reconstructions that covers every sound correspondence within a comparative corpus. (Two-language comparisons could be always trivially considered to be unlabeled reconstructions.)
    • An unlabeled reconstruction is fully reflected if every pre-reconstruction contains a fully attested correspondence pattern. (In the case of highly split correspondences, we might not want to demand this of every minor correspondence pattern.)
    • Likewise, an unlabeled reconstruction is well-supported if every pre-reconstruction contains a well-supported correspondence pattern; etc.

Note that while an unlabeled reconstruction covers the entire system of correspondences in a given corpus of word comparisons, pre-reconstructions are segmentwise, per one alignment “slot” at a time (and this could maybe use a better term; “unlabeled segment reconstruction” doesn’t strike me as progress though). Also, as we already established in the previous section, correspondence patterns cannot be simply classified as “regular” or “not regular”. They are “once-soundlawful” by definition, but not anything more.


Continuting working with the example from above, Fi. v ~ NS v and Fi. v ~ Er. в are sound correspondences; Fi. v ~ NS v ~ Er. в is a correspondence pattern that combines them, and moreover suggests also the existence of a third sound correspondence, NS v ~ Er. в. Once we observe that this correspondence pattern occurs exclusively word-initially, it can be combined with also a corresponding word-medial correspondence pattern (Fi. v ~ NS vv ~ Er. в) into a pre-reconstruction: Fi. v ~ NS v-/-vv- ~ Er. в.

There would be other options, e.g. to combine the word-initial pattern with a different medial correspondence pattern that it also overlaps with: Fi. v ~ NS ~ Er. в. Note that we usually choose the first option (and say that they reflect Proto-Uralic *w, while the second reflects PU *ŋ) primarily due to the greater phonetic similarity. It would be entirely possible to shift them around in the reconstruction, to claim that PU *w nasalises to *ŋ in Samic (etc.), but that there was also a segment *hʷ that occurs only medially and always lenites to *w or similar. This is in all respects exactly as regular as the usual reconstruction with *w and *ŋ; it only does worse in terms of how natural the required sound changes are.

It’s also possible to advance an objection against the demand for sound correspondences to overlap before they can be combined in the same pre-reconstruction, even if it is clear that often combining non-overlapping sound correspondences would create nonsensical pre-reconstructions. Suppose a proto-language had some prominent allophonic distribution, e.g. between word-initial voiceless *[t] and word-medial voiced *[d]; but, independently in all descendants (e.g. perhaps due to later sound changes such as *st > /t/ or *ð > /d/), /t/ and /d/ have become different phonemes. Then, even if we take phonological and not phonetic data as out input — word-initial t ~ t ~ t and word-medial d ~ d ~ d will end up being two different correspondence patterns with no overlap between them.

Is this a problem? Not necessarily. It seems to me that unifying these as the same proto-phoneme is not a task of reconstruction: it is a task of the phonological analysis of the proto-language. That is, we see that reconstruction outputs “allophonemes”, not phonemes, possibly even in the case where the input is phonemic data. Due to this, it can be often a good idea to also not use phonemic but rather similarly “allophonemic” input data. Suppose now a case where medial voicing of stops has remained purely allophonic in all descendants of a proto-language: in this case, the allophony rule could be still reconstructed, but only if we do not first eliminate it from the data by collapsing [t-] and [-d-] into /t/.

(Despite these examples being simplistic, it is also the case that identifying nonphonological contrasts in a reconstruction is often not trivial. One of the more surprizing adjustments to Uralic historical phonology over the last few decades has after all been the result that, while traditionally reconstructed *oo and *ee do seem to contrast with *o and *e — they do not contrast with *a and *ä, and in fact have massive overlap with them in their reflexes, even if the proposed phonological proto-values are quite different. The solution to this has also not been to set up [a ~ oo] and [ä ~ ee] in an original allophonic relationship, it has been to recognize *oo and *ee as later innovations exclusive to Finnic.)

Lastly, a few statistical measures of pre-reconstructions that I can think of, which might come useful eventually.

  • The multiplicity of the pre-reconstruction is the number of correspondence patterns it encompasses.
  • The split count = S of the pre-reconstruction is the number of phonological splits it tracks. If a language shows N different reflexes across a pre-reconstruction, the split count of this language for this segment is N-1; the total split count is then the sum of these across the dataset.
  • The expected multiplicity is 2^S. The real multiplicity can be often smaller, though, both due to similar conditioning in several languages, and due to gaps in the data where two conditioning factors by accident do not occur in any word (even if this would be theoretically possible). Some general positional considerations could be applied to calculate a better expected value.

Higher Regularity

Continuing on. Before we start adding too many near-regularities and irregularities on top of phonological foundations, it is worthwhile to consider how far we might be able to get with just them.

A naive guess could be that the best phonological foundation for some large family like Uralic consists simply of gathering as many phonological cores as possible, taking their union, and topping up with any regular adduct comparisons that fit into this system. I think this is probably a bad idea, though. I’ve bordered above on the problem that it is often possible to identify phonological cores that consist of loanwords. These can be not just loanwords to/from an outside source; they can be also inside a family, creating false correspondences. There might be also some small number of accidental cores out there, even. E.g. nursery words of the mama papa dada type will easily allow establishing a regular correspondence a ~ a between almost any languages in the world, and it would only take a few coincidences to end up being able to show their consonant correspondences to be regular too.

As established, one way to weed these off will be examining the big picture of the sound correspondences and demanding biconnected etc. regularity (essentially an argument from distribution). Another clear source of false positives though is that so far I have not been very strict in defining “regularity” to begin with: I’ve accepted mere recurrence of any kind as sufficient. Normally, two examples of a sound correspondence is actually only very feeble evidence!

My assumptions, previously unspoken, have been the following:

  • If a linguistic relationship is real, then most sound correspondences will recur, over and over, within and between different cores, and build up naturally in this way once we start considering larger foundations.
  • Sound correspondences come in an exponentially decaying longish-tail distribution, and that while some will end up recurring quite abundantly, most don’t.

The second is particularly because of conditional splits, which will divide any proto-segment across multiple correspondence patterns. Between all three of Finnish, Northern Sami and Erzya, there are some 40–50 examples known of the word-initial sound correspondence k ~ g ~ к, some 20–30 for the nextmost abundant examples like word-initial p ~ b ~ б (and it is not coincidental that neither of these consonants has been affected by any further conditional sound changes in any of the three languages); but for the most poorly attested regular correspondences, we indeed have to make with just two examples between just two languages, before fading into correspondences that are regular only when routed through some additional language, or regular soundlawfully but not by binary comparison, or only semi-regular, or irregular entirely.

Could we just require every pairwise sound correspondence to occur at least thrice, and then work with “3-cores” and “3-foundations” as the most reliable key evidence? This is probably possible between some closely related languages. I am however uncertain if there would exist any of these for wider Uralic at all. There definitely are not any neat and compact nine-item cores that look like *pala *pola *pula | *tala *tola *tula | *kala *kola *kula (analogous in structure to my four-item cores covered above). This is for two reasons: (1) given the “long-tailedness” of pairwise sound correspondences, it is unlikely to find many high-frequency correspondences co-occurring in a word comparison; (2) in Uralic in particular, word roots/stems are relatively long, 4–5 segments, which makes it even harder to find a word comparison that avoids all the rare-but-regular sound correspondences.

Maybe some other condition needs to be relaxed at the same time? E.g. counting things on the pre-reconstruction level instead. After we’ve identified a complementary distribution e.g. among the different Samic cognates of Finnish /v/, we could then recognize Fi. v ~ NS v- and Fi. v ~ NS -vv- as the same meta-correspondence, and so on forth… But this actually already pares things back to the level of mere soundlawful regularity: all soundlaws affecting some proto-segment are already encoded within the correspondence bundles of a pre-reconstruction, and only a phonetical label for the proto-segment is missing. And demanding more than two examples of a reflex is not too hard at all.

A better option is perhaps to instead use the fact that words are inherited as a whole. If a word comparison shows three highly recurring correspondence patterns and one more poorly attested but still regular one, the three first should also allow us to put more trust in the fourth not being accidental. We could even calculate the average regularity. To avoid high-frequency correspondences “covering for” too many low-frequency ones, though, this should also probably be the geometric mean, not the usual arithmetic mean.


It’s even possible to propose that wordwise average regularity (let’s abbreviate this to WWAR) should to some extent trump segmentwise regularity altogether. Consider again some case like Fi. kusi and Hu. húgy that is not perfectly provably regular. That we still “want to” relate them can be after all motivated also without reference to the other Finno-Ugric languages, or to any detailed semantic considerations, by how k- ~ h- is a highly regular correspondence. So is -i ~ ∅, though this is a bit too “morphological” to fully count. [8] u ~ ú is also attestable, if rarer [9] and without well-known conditioning factors.

Besides giving a natural way to incorporate nonprovable and exceptional correspondences into an “extended phonological foundation”, WWAR is a measure that has also a few further good features. For one it largely captures the fact that short CV and CVC comparisons are more vulnerable to chance resemblances. For two, inversely it allows putting a bit more trust in word comparisons involving consonant clusters, which often show some highly conditional sound changes ⇒ not highly regular correspondences. In a comparison like Fi. täysi : täyte- ‘full’ ~ Hu. tel- ‘to be full’ we can then rely on as many as four highly or at least reasonably regular correspondences (t ~ t, ä ~ e, s : t ~ l, i : e ~ ∅) and not have to worry about y ~ ∅ too much.

But I think that it is still also necessary to start with solidly regular foundations, since the frequency of a sound correspondence depends on the corpus of word comparisons. Adding kusi ~ húgy to a corpus of Finnish–Hungarian word comparisons, from the sound correspondence point of view, does not only add the one case of s ~ gy, it increases by one also the counts of the other three correspondences, i.e. makes them more regular still. This being the case, there could be a risk of “farming” some highly recurring correspondences from numerous exceptional or nonprovable word comparisons, and using these as the main workhorse carrying the reconstruction. This was one problem in Uralistics in the 19th and early 20th century: a good understanding of some of the stronger correspondences had been worked out (in particular among consonants), which were relied on to accept also all kinds of poorer correspondences (in particular among vowels). Similar examples occur elsewhere in the history of etymology too, I’m sure (insert cliché allegedly-Voltaire quote here).

Given a corpus of word comparisons that is known to include some crud, there should actually exist a sweet spot of a sort. Calculate the WWAR and also the average WWAR across the corpus; then prune the lowest-WWAR comparison(s) and see what happens to the average WWAR. Eliminating highly irregular crud should raise this metric. But also pruning everything down to just a single phonological core would leave the average WWAR at no more than 2. Somewhere, then, there will be a maximum average WWAR that will be in a sense the most regular sub-corpus that can be achieved. There can be multiple local maxima though (there definitely are “between” cores, again as per my above example with vala ~ вал), and I’d have to work through a larger example corpus in detail to see.

Defining WWAR for non-binary comparisons will be something to figure out later also. Would just covering all the pairwise correspondences work? Perhaps it does. E.g. we can note that the number of pairwise correspondences grows quadratically as the number of independent members in an etymological comparison increases, and so this metric would naturally capture the intuitive impression that widespread etymologies are stronger (increase average WWAR more) than narrowly spread ones are.

Etymological Leftovers

I can think of one further potential problem in approaching reconstruction primarily as collecting phonological cores. A particular etymology could be quite regular between a handful of languages, but not between others. Maybe some further cognate shows unexplained quirks, or in some language group there exists a proposed but very dubious maybe-cognate. This is very common across Uralic, probably in any deeper and wider language family really. How worried should we be if these cognates turn out to not fit into phonological foundations?

For a demonstration of the issue, a few examples from what I generally consider AAA-class Uralic vocabulary overall:

  • *ëla- ‘under’: Mansi shows *jal- instead of expected **ëël-.
  • *elä- ‘to live’: Mordvinic has for this meaning *eŕa-, irregular on every segment but phonetically fairly close regardless.
  • *enä ‘big’: Komi has /una/ ‘many’ instead of expected ˣ/on-/. Udmurt /una/ is in principle regular, but per Komi this may have been irregular *una and not **ona already in Proto-Permic.
  • *ďëmə ‘bird cherry’: Erzya shows /lʲom/ with unexpected /o/ and unexpected initial palatalization, Moksha shows /lajmä/ with unexpected intrusive -j-, and even a common Proto-Mordvinic form does not seem to be readily reconstructible.
  • *ipsə ‘smell’: Hungarian has íz with irregular /z/ (maybe nonprovable as a reflex of *ps in particular). Moksha has /opəś/, irregular on every segment except *p.
  • *jäŋə ‘ice’: Permic has *jë, with unexplained loss of *ŋ (which has parallels though, so technically regular) and an irregular vowel.
  • *jëxə- ‘to drink’: the labial vowel in Samic *jukë-, Finnic *joo- is not really expected and has no exact parallels.
  • *kajwa- ‘to dig’: Samic has *koajvō- instead of expected **kuojvē- or **kuojvō-.
  • *kälä- ‘to wade’: Mansi *kʷääl- has an unexpected labialized initial, Khanty *küüL- unexpected height and labialization of the vowel, instead of expected **kääl- and **kööL- (or **käL-).
  • *kätə ‘hand’: Mari has *kit instead of expected **ket.
  • *kiwə ‘stone’: Udmurt has /kɤ/ instead of expected /ki/ (which does occur in Komi).
  • *kulkə- ‘to go’: Hungarian has halad instead of expected ˣhol- or similar.

(Incidentally it is noteworthy that while there are some consonantal problems too, all of these cases show some vocalic problems.)

Regardless all of these etymologies show perfect soundlawfully regular reflexes in at least six other languages. At least the comparison of these is beyond any reasonable doubt. With these exception cases it’s however conceivable that some of them have in fact been adduced erroneously and should be treated as e.g. family-internal loanwords or as unrelated. [10]

There is also a smaller group still of completely and unambiguously clean widespread Uralic etymologies, including e.g. the above-considered *kala ‘fish’ and *pala ‘bit’. Should we perhaps prioritize these cases somehow when building up a phonological foundation? Maybe not. *kala and *pala both happen to lack known reflexes in Permic… If I were to propose some outrageously irregular reflexes from there, does this in any way weaken the other pairwise comparisons? The same really holds for more promising irregular reflexes too. As a reminder, the main point of the framework I am sketching in this post is to assess if a proposed reconstruction or system of correspondences is acceptable, or if it is better than some other proposal. That there remains more work to do is a different issue.

At other times still, a proposed etymology could have deeper fault lines, such as being more regularly considerable as two etymologies, possibly with a bridging member. These are also findable across Uralic, e.g. when western languages point to *kakta but eastern languages to *kettä as the proto-form of the numeral ‘2’. It is not clear to me what to do in such cases. They can still e.g. demonstrate branch-specific sound changes as long as we keep a leash on which pairs of languages are compared.

In any case, much like widespread sound correspondences, widespread etymologies are not all-or-nothing cases. They may be more regular between some languages, less regular between others. Single outliers or multiple equally distant ones will be easy to identify and possibly exclude at least. It is surely a problem if a proposed language family starts having substantial amounts of etymologies which only really work between a few languages and not any others, but this might well be a problem of etymological work and not of the relationship itself. It is hard to think of any formal justification for treating an irregular etymology as “too good to be rejected”. Substantial and intractable irregularity is a good reason to decide that a proposed cognate is just wishful thinking built on superficial similarity, or at very least too weak to build a foundation on, no matter how long e.g. its pedigree in etymological literature is. The best illustrations for this principle surely come from cases where a different, more regular etymology turns out to be possible after all. The classic of the genre is the superficial resemblance of Latin deus and Greek θεος. From Uralic, consider e.g. Livonian sūoŗ ‘vein, sinew’: by current thinking this is not a reflex of PU *sënə ‘id.’ (> Proto-Finnic *sooni, reflected in all the rest of Finnic) with irregular *n > *r (> ŗ /rʲ/), it is instead a perfectly regular reflex of a distinct but partly synonymous PU root *särä. As an older example I could mention the comparison of Fi. aivot ‘brains’ with Northern Sami oaivvi ‘head’, which appears in some 19th-century works before being replaced by the current comparisons: Fi. aivot ~ NS vuoigŋašak ‘id.’, Fi. oiva ‘proper’ ~ NS oaivvi (both fully regular even if less immediately apparent). Similar examples could be collected at least by the dozens from etymological literature. [11]


There is one failure mode of overcriticality around this area though; it is one where difficulties in reconstruction are confused with irregularity. E.g. as I’ve pointed out before, in Khanty the development of *kala, *pala differs from a third rhyme word *sala- ‘to steal’. But *ɬaaL- as the reflex of the third is not irregular in any sense I’ve defined so far! There are several cases of *a-a > *aa, hence also several cases of correspondences like Finnic *a ~ Khanty *aa or Samoyedic *å ~ Khanty *aa. The only problem is in our lack of understanding of the conditioning factors that lead to a double representation *uu ~ *aa. It would be possible to e.g. propose an entirely regular reconstruction of PU with two open back vowels, *a and *å, distinguished only in Khanty.

Where from here

Whatever the exact route, it would be a long and at many points tedious exercise to work up from small phonological cores all the way up to our current understanding of Uralic etymology and comparative phonology. This would be regardless illustrative, I think. If we repeated the process with a few other language families too, we might be able to eventually set up an objective metric for how phonologically (ir)regular some known or proposed language relationship really is. Also, just the largest achievable phonological foundation is probably not a good metric. My suspicion is that allowing any and all minimally regular correspondences, without constraints for their number, will lead to a vast ballooning of the system of correspondences that can take just about anything we throw at it (parallel loanwords, parallel derivatives, onomatopoeia…), and something like WWAR will be a much better metric of regularity.

There will be further technical issues to work out too, such as the effects of subgrouping and intermediate reconstructions (which could be used to define something like “phonological subgroupiness” also); the methods we use for identifying conditional sound correspondences; or adding typological constraints for the segment inventory of the proto-language or the sound correspondences we will tolerate (a correspondence like *n ~ *n should surely require less evidence to be acceptable than a correspondence like *m ~ *k).

[1] In language families such as Uralic, which I call “trochaic” or “left-rooted” (I should probably expand on this concept later on as well), alignment is really largely trivial: initial consonants or zero initials always correspond, first-syllable vowels always correspond, medial consonants and respective components of clusters always correspond, stem vowels in languages with bisyllabic roots always correspond. Complications start to arise only in corner cases like metathesis, initial-vowel syncope, or derivational suffixes added to CVC stems. Diphthongs and long vowels could provide some problems too, but then contractions like *ej > ii can be always also rewritten as conditional correspondences along the lines e ~ ii and j ~ ∅.
[2] Then again, what we are currently accomplishing with AI in fields other than linguistics suggests to me that automated linguistic reconstruction cannot be done right on the first try in any case. Any reasonably feasible algorithm most likely has to be based on generating a first pass and then iterating improvements to it. If we are good enough at the latter, it’s OK if the former is still fairly bad. This how real reconstruction also works, after all.
[3] In particular there are no phonological cores built out of comparisons covering only two languages but with the data altogether covering more than two languages: every two-language pair could be separated as its own core instead.
[4] What’s off is that the different treatment of the final vowels in Erzya is actually not due to any original difference in their strength, it is due to a recent and weirdly specific innovation syncopating *ə after *Cal. Unsyncopated forms have still been attested in Witsen’s 17th century records of Moksha.
[5] The absolute minimum is a comparison of two items with two segments that are the same in both, e.g. al, la in language 1 ~ er, re in language 2, or indeed, a comparison of two pairs of homophones.
[6] All single medial consonants are geminated in most of Sami in strong-grade positions (hence with sound correspondences distinguishable from initial consonants), all final vowels are lenited in languages like Mansi (hence distinguishable from initial-syllable vowels), all original consonant clusters are simplified or broken apart in Hungarian, etc.
[7] “Biconnected” can be taken to mean that between any two vertices, there are at least two mutually disjoint paths, or that the removal of any one vertex will not break the graph into two or more non-connected components. Upgrading these definitions to “three paths” or “two vertices” may not yield quite the same meaning for putative “triconnected” (clearly the former is stronger than the latter though).
[8] After all it has been proposed that in Finnish e-stem words, at least ones like this that have consonant-stem partitive singulars (kusta), only √kus- is really a part of the stem and -i : -e- is a prop vowel. Or even, -e- at least: another possibility, probably more provocative, would be to claim that -i is a nominative singular ending.
[9] Traditionally known proposals include pura ‘drill’ ~ fúr ‘to bore’, suippu ‘point’ ~ csúp ‘point’, survoa ‘to mash’ ~ szúr ‘to pierce’.
[10] E.g. relaxing semantics a bit, Mansi *jal could be compared also with *jalka ‘foot’ or #jülŋä ‘tree stump’ (though these only help with the *j-). For Hungarian íz (dialectally also éz), Finnish & Karelian eto- ‘to find disgusting’ seems like a promising direction of comparison.
[11] And perhaps they should be. Etymologies are most of the time cited in secondary and tertiary literature without the scaffolding of historical phonology that holds them up in the first place. I suspect this often leads to beginners and non-historical lingusts getting the false impression (or maybe rather, strengthening the natural folk-etymological impulse of thinking) that just similarity is good enough for setting up an etymology.

Tagged with: , , , ,
Posted in Methodology

Native initial clusters in Udmurt

Typological definitions of Uralic [1] just about always note the lack of native word-initial consonant clusters. While the literary standards have their share of IE-derived clusters by now, in rural dialects and the Siberian languages clusterlessness is common enough to this day. However, exceptions can be found in the other direction too, although they seem to be an understudied topic.

The most obvious offender is Mordvinic, which sports all kind of words like /kši/ ‘bread’, /kšńi/ ‘iron’. Perhaps in most cases these are IE loanwords in ultimate origin, but involving native syncope, in these examples from preforms along the lines of *kərsä, *kərtnä. Russian influence can be still suspected though, since apparently this syncope is mostly post-Proto-Mordvinic. Two illustrative examples: Moksha /kštralks/ ‘bobbin’ ← /kšťir/ ‘spindle’ + /alks/ ‘bottom’, cognate to Erzya bisyllabic /ščeŕalks/; Erzya /troks/ ‘across’, cognate to Moksha /tərks/, /turks/ (PMo *turəks?). But I do not think the details of the development of these has been worked out in full, and several cases built on native Uralic material can be found also, such as Er. /pŕa/ ~ Mk. /pŕä/ ‘end, head’ (< PU *perä), /pškaďems/ ‘to blow’ (~ Fi. puhkua, Komi /pušky-/ < *puš-kV-). I could also submit some new wilder etymological hypotheses: e.g. could /pra-/ ‘to fall’ be from *pda- < *pədá- < PU *puďa- ‘id.’ ??


The precise history of the Mordvinic initial clusters would really be a fairly large research project. Before diving into it, a decent typological parallel and a much more tractable case study of natively arising clusters in Uralic seems to be provided by Udmurt. In the literary standard, consonant clusters outside of Russian loanwords are rare but still extant. They seem to have a slightly extended presence in the dialects also. I’ve almost never seen this fact explicitly pointed out however, it came to my full attention accidentally only this March, while reading Michael Geisler’s Vokal-Null-Alternation, Synkope und Akzent in den permischen Sprachen (2005, Veröffentlichungen der Societas Uralo-Altaica 68) which primarily treats V2 syncope.

Interestingly there is a fairly simple phonological rule behind the rise of initial clusters in Udmurt: /ɨ/ is lost in the position CɨCV₂, where V₂ is a full = non-/ɨ/ vowel (though I have no examples with /u/), if the result is a “legitimate” consonant cluster. Nearly all examples I’ve found adhere to this (see below for one clear + one possible exception), and I’ve also found no widely distributed counterexamples in underived roots. In derivatives from or inflected forms of CɨC or CɨCɨ roots, syncope could be expected to be mostly reverted / prevented by analogy of course.

There is more uncertainty in the details of what counts as a “legitimate” consonant cluster, as well as in how widely this rule is reflected in the Udmurt varieties (it is almost surely post-Proto-Udmurt). The data below is mainly from the intersection of Wichmann’s Wotjakischer Wortschatz and Csúcs’ Die Rekonstruktion der permischen Grundsprache, the latter taken into account to ensure I am indeed dealing with inherited Permic material and not recent loanwords / coinages entirely. Dialect abbreviations are G(lažov) (northern), S(arapul) and M(almyž) (central), J(elabuga) (southern), Uržym (MU) and U(fa) (southeastern).

The best-established cluster type is stop + /r/:

  • /dɨr/ ‘probably’ → MU /dɨrak/ ~ /drak/ ‘id.’
  • /kɨrɨ-/ ‘to dig’ → G /krem/ ‘dike’
  • /kɨre(d)ź/ (most dialects) ~ literary & U /kreź/ ‘traditional box zither instrument’
    (similar to the Russian gusli, Finnish kantele, etc.)
  • /pɨr/ ‘always’ → /prak/ (several dialects) ‘id.; straight’
  • /pɨrɨ/ ‘piece’ ~ U /pri/ ‘id.’
  • /tɨr/ ‘full’ → /tɨros/ ~ /tros/ (several dialects) ‘id.; many’

In the Wichmann+Csúcs data, there is also one example each of /pl-/ and /sl-/:

  • /pɨlaśkɨ-/ ‘to bathe’ ~ G MU /plaśkɨ-/
  • /sɨlal/ ‘salt’ ~ G /slal/

So no big surprizes so far, just falling-sonority clusters of a globally common type.

A very different case can be found in ‘rye’: /dźeg/ in literary Udmurt and almost all dialects, but with a bisyllabic byform /dźɨźeg/ ~ /dźiźeg/ in Uržym, which is clearly more original in light of the Komi cognate /rudźɤg/. [2] Also interesting is the Ufa form /źeg/, since in this variety word-initial *dź- normally gives a nonsibilant affricate [ɟʝ-] (= ďj in Wichmann’s transcription). I would hypothesize that this is not a case of *dź+ź losing the second member, but rather *dź+dź losing the first member, already before the lenition *-dź- > /-ź-/ that is found in most dialects of Udmurt; then this new *dź deaffricates in Ufa even initially, which nicely parallels it having also *dž- > /ž-/. [3] Of course most of this might be also simply some sort of haplology, rather than ever going through an actual cluster *dź(d)ź- at all.

Another haplologyish case is ‘eight’. Per /kɨk/ ‘two’, and Komi /kɤkjamɨs/, the pre-syncope Proto-Udmurt form of this must be *kɨkjamɨs (as reconstructed also by Wichmann). Only syncopated forms have been recorded though: /ťamɨs/ in most dialects, a byform with /ťj-/ in Glažov as the only hint that something is up. [4]

Another group still is built up by clusters of the type sibilant + stop/nasal. These demonstrate some “regression to the mean” — they tend to “de-cluster” again across the Udmurt dialects, but this time by epenthesis of an initial vowel: /i/, partly also /ɨ/. This of course leaves the cluster as such intact, but does break it into two different syllables. The Glažov variety appears to fairly consistently retain the elsewhere syncopated original vowel however, though possibly colored to /i/ by palatals. As for where an epenthesized form occurs or not, I see no pattern. Double representation is common, and probably both variants exist widely side by side, and the literary standard and in some cases Wichmann have randomly ended up sampling only one or the other.

  • G /sɨkal/ ~ literary, G J MU U /skal/ ~ S M /iskal/ ~ J MU U /ɨskal/ ‘cow’
  • G /sɨpaj/ ~ U /spaj/ ~ MU /ispaj/ ‘beautiful, good’
  • G /šɨnɨr/, /-ń-/ ~ literary, S M J /iɨr/, MU /ińšɨr/, U /iɨr/ ‘threshing ground’
    ~ Komi /rɨnɨš/ < *rɨŋɨš < *riŋəšə > Finnic *riihi
  • G /śińer/ ~ M /er/, MU /śńer/, U /šńer/ ~ literary, S J /iśńer/, M /iśńɤr/ ~ U /ɨšńer/ ‘broom’
    ~ Komi /jiś/; compound with /ńɤr/ ‘twig, rod’
  • G /śike/ ~ MU /śke/, /ske/ ~ literary, M J /iśke/ ‘so, thus’
    ~ Komi /eśkɤ/ ‘conditional mood particle’

Syncope-then-epenthesis would not be the only possible history for these, but this has support from the fact that both /ɨ/-syncope and pre-sibilant epenthesis can be independently attested, the latter in Russian loanwords such as U /smolla-/ ~ M /ismola-/ ‘to tar’  ← смола ‘tar’, G /šľapa/ ~ MU /iślapa/ ← шляпа ‘hat’, G /štop/ ~ J /ɨštop/ ‘jug’ ← штоф ‘a measure’, J /iźver/ ‘predator’ ← зверь ‘beast’. Note also the lack of epenthesis to **šiľapa, **šɨtop in G.

I assume ‘threshing ground’ has been further metathesized from expected *išnɨr by folk-etymological influence of /in/ (~ J MU /iń/) ‘place’. Why the Ufa form has /m/ I have no idea; that the word had proto-Permic and thus most likely also Proto-Udmurt *-ŋ- does not clarify anything. In ‘broom’ Komi seems to suggest original *(j)iś-, but as this has no further etymology, maybe this is rather a loan from Udmurt with the 2nd part dropped. Also, perhaps the first part of the compound is /śi/ ‘hair, bristle, fibre’ (also occurring with /ɨ/ in MU /ďɨrśɨ/ ‘head hair’)? A broom is indeed a ‘rod with bristles’.

In ‘so, thus’ is the initial vowel is clearly original however, as this comes from Volga Bulghar *ećke > *ićke (> Chuvash /əśke/), and hence requires etymological nativization in G.

The case of ‘cow’ then seems to have relevance beyond Permic even. This has known cognates a bit more widely, but these are also syncopated and partly even epenthesized! /skal/ in Mordvinic, /škal/ ~ /əškal/, /ŭ-/, /u-/ in Mari. UEW treats these as coming from a common protoform *uskalɜ with somewhat arbitrary loss of the initial vowel. However, if I am correct about the Glažov forms in /SIC-/ being mainly archaisms, then this is probably not correct: at least the Mari forms should be considered one or more loans from Udmurt specifically. The Mordvinic form could still have come about by parallel syncope. As commented by Bereczki (1992), retained /a/ in the original 2nd syllable most likely regardless indicates an areal loanword of some unknown origin. But now we seem to know that the shape of this source has probably been more like #sukal or #sikal than #skal or #uskal.

A sixth possible member in this group could be /ɨštɨr/ ~ M /ištɨr/ ‘footrag’ from a *štɨr < *šɨtɨr, as I suspect on grounds of the unmotivated /i/ in the Malmyž variant. This too has a Mari equivalent /štər/ ~ /əštər/ that would again have to be a loan from Udmurt. Furthermore these have been compared by Wichmann [5] even with Finnish (+ Karelian, Ludian, Veps) hattara ‘footrag’ < ? *šattara. The vowel correspondence a ~ /ɨ/ is rare and irregular though, so probably this is in any case not all the way from Proto-Uralic. The Finnic lexeme has also an alternate etymology as a semantic specialization of the homonymous hattara ‘fluff’.


‘Threshing ground’ and my proposal for ‘footrag’ diverge from the other examples by showing syncope from *CɨCɨC. Even these could be seen as kind of regular, once we consider the mechanism more carefully. The basic conditioning mechanism is surely not vowel quality per se, but rather stress. A typical feature across the more central Uralic languages (and also Turkic!) is a pattern where stress is still technically initial by default, but is widely retracted onto “stronger” vowels (long, full, open) later on in the word. In other words: syncope targets specifically pretonic /ɨ/. This would suggest that the immediate precedessor of hypothetic *šnɨr and *štɨr was more specifically iambic *šɨˈŋɨr, [6] *šɨˈtɨr, in contrast to trochaic stress on the more typical *CɨCɨ roots. If so though, it is too early for me to take a guess on what would have been the reason for such unexpected stress placement.


If there has been fairly regular loss of pretonic /ɨ/ in Udmurt, a natural follow-up question is: what about examples where this doesn’t lead to initial consonant clusters?

Two subtypes can be considered. The first would be aphaeresis: we would expect words of the shape *ɨˈCV(C) (where V ≠ ɨ) to again loose the first syllable and to give plain monosyllabic /CV(C)/. There perhaps are too some of these out there, since per Csúcs’ Proto-Permic vocabulary, it seems that no examples of this root shape have survived intact in Udmurt. The only examples of basic word roots with surviving word-initial /ɨ/ are all either monosyllables (/ɨľ/ ‘moist’, /ɨń/ ‘flame’, /ɨm/ ‘mouth’…), have /ɨ/ also in the 2nd syllable (/ɨbɨ-/ ‘to shoot, throw’, /ɨšɨ-/ ‘to be lost’…), or have an intervening consonant cluster that would not work as a legitimate word-initial cluster (/ɨrgon/ ‘copper’). /ɨ/ remains also in the compound /ɨbes/ ‘gate’ (from /ɨb/ ‘field’ [7] + /ɤs/ ‘door’) and a few inflected forms like /ɨč-e/ ‘such’. However, I also do not find any clear enough candidates where a Komi word of the shape /ɨCV/, /uCV/, /ɤCV/, /iĆV/ would lack an Udmurt cognate. The closest are Old Komi /idɤg/ ‘angel’ (whose hypothetical Udmurt cognate would be expected to be *ideg and not **ɨdeg > **deg), and Komi /ɨrɤš/ ‘ale’, a derivative from a lost verb *ɨr- (so perhaps derived only within Komi). The root shape *ɨCV(C) indeed seems to be lacking in Proto-Permic entirely. Would it be too bold to hypothesize that these have actually lost their initial vowel in both Komi and Udmurt?

One speculative etymology of this type could be ‘udder’. The more southwestern Uralic groups all use some sort of a loanword from Indo-European (F. *udar, Mo. *odar, Mari *wåðar). The Permic languages however have an unetymologized /vera/ ~ /vɤra/. If this came from earlier *ɨvɤra, perhaps it could be a part of the same group after all? But the final /a/ looks worrisome (‘udder’ seems to have been a consonant stem all the way from PIE to attested Indo-Iranian reflexes), as does wringing /v/ out of *-ð- < *-d- < *-t-, which normally lenites all the way to zero in Permic.

— The second possibility of difficult-to-detect *ɨ-loss is syncope before a zero medial. Udmurt has occasional bisyllabic vowel clusters of a relatively wide variety, e.g. /ju.a-/ ‘to ask’, /ju.ɨ-/ ~ /jʉ.ɨ-/ ‘to drink’, /ju.o/ ‘I will drink’, /ki.on/ (~ /kijon/) ‘wolf’, /lu.o/ ‘sand’, /na.a-/ ‘to look at’, /śi.ɨ-/ ‘to eat’, /vu.ɨ-/ ~ /vʉ.ɨ-/ ‘to come, come to completion’, /vu.em/ (~ /vujem/) ‘row’ (< *’order’ < *’completion’). There however again seem to be no examples of the shape Cɨ.V — or at least: none surviving as such.

I can also propose at least one actual candidate for this type of syncope, with a bit more confidence than the previous example even. In nouns we would expect this to yield a simple monosyllabic /CV(C)/ root. There are, however, no fully monosyllabic verbs in Udmurt! All have a stem vowel, in the citation form = the infinitive (ending in /-nɨ/) either /ɨ/ or /a/. This marks what at least some (maybe most?) grammars call the inflection class of the verb. [8] So what would happen if a verb of the shape *Cɨ.a- were to be syncopated to *Ca-? It seems to me that a likely outcome would be to pleonastically apply a second instance of class-marking /a/. This gives, I think, at least a good hypothesis for what’s up with the rather strange-looking /na.a-/ ‘to look at’ (attested only in the Besermyan dialect). This has usually been considered a reflex of PU *näkə- ‘to see’, but we would expect the reflex of this root in Permic to be rather *ni- (cf. /ki/ < *kätə ‘hand’) or perhaps *nɨ- (cf. /tɨ/ < *täwə ‘lung’). And my thinking is that this may have been indeed the case still in Proto-Permic, if this Udmurt form comes from earlier *na- < *nɨ.a-.


The attrition of initial consonant clusters from a language’s phonology can be observed in dozens of languages (Sinitic, Tibetic, Indic, Iranian, Armenian, Albanian…). Their introduction natively seems much rarer though; yet this process should be equally important for understanding large-scale shifts in typology. Other examples I know of are however mostly limited to a few way-out-there cases that clearly must have deleted vowels quite rampantly, e.g. Itelmen (Western /kɬfənʲck/ ‘in front’), the Okinawan languages (Amami /ʔkwa/ ‘child’, Ōgami /pstu/ ‘person’), or all the “sesquisyllabic” languages of SE Asia. Udmurt is in contrast a quite pleasantly tractable case, where only modest clusters have arisen in minor amounts. Yet, as the *ST- *SN- section shows, even these can throw up further complications. Some further cross-linguistic comparison with other cases would be interesting… if they first can be found somewhere. I suppose I already have Mordvinic lined up next. Another case I’ve seen reported is Central Dravidian, where the main rule seems to be a Slavic-esque liquid metathesis. But that’s about it for leads I have within the major Eurasian language families that I have the most knowledge of. Probably I would have to look into e.g. minor Niger-Congo or Austronesian languages (subfamilies even?) to find further cases where it is relatively sure that a language has definitely evolved from allowing only simple onsets to allowing initial consonant clusters.

[1] Not that there are any exclusive and pan-Uralic typological features; any kind of a “Uralic typological profile” immediately bleeds further east towards “Ural-Altaic” and/or “Uralo-Siberian”.
[2] This is surely in turn in some fashion from Germanic–Balto-Slavic (and → Finnic) *ruǵʰis, seemingly with either metathesis (PP *rudźeg < *rugedź, somehow via Finnic?) or a new velar suffix (PP *rudź-eg ← *rudź via BSl.?; loss of *-s is expected, cf. *pårś ‘pig’ from IE *porḱos).
[3] The Ufa variety clearly must’ve already had its /ďj-/ at this point too. I even wonder if this could be a hint that this is a retention, that Proto-Udmurt *dź- (< PP *dź- and also *r- / _VĆ, _V{s z}) was actually rather nonsibilant *ďj- or even a stop *ď-. Which could then have interesting further implications too, e.g. should this be perhaps applied even to Proto-Permic? Already for Proto-Udmurt and Proto-Komi, very few instances of *ď can be reconstructed, even fewer still for Proto-Permic, and only word-medially it seems.
[4] What’s also curious is that most varieties do not have a general shift *kj > /ť/, and palatalization seems to have taken place in this case only due to the cluster *kj having been forced to occur within one syllable. In some cases apparent palatalization can be found also medially, e.g. ‘to laugh at’: standard and most varieties /śerekja-/, but Uržym /śereťa-/ ~ /-kť-/ ~ /-ḱ-/. But then this variety also has general *j- > /ď-/ word-initially, and that’s probably also how the form with /-ť-/ comes from in the first place, not as *kj > ḱ > ť. (The /-kť-/ variant also suggests the same.) This pathway looks even clearer when compared with Ufa /śeregďa-/, where presumable intermediate *-kď- has assimilated in voicing regressively, not progressively.
[5] Wichmann, Yrjö. “Etymologisches aus den permischen Sprachen”. — Finnisch-Ugrische Forschungen 12: 128–138.
[6] While *ŋ > *n is regular in a few Udmurt dialects, including in Glažov when adjacent to /ɨ/ (*šɨŋɨr > /šɨnɨr/), it is not widespread enough to seem to me like the main explanation for why no /ŋ/ remains anywhere else either. What strikes me as more likely is that syncopated **šŋɨr was immediately adjusted to *šnɨr due to Udmurt not tolerating word-initial /ŋ/.
[7] A cranberry morpheme in attested Udmurt, but an independent lexeme in Komi, and per also a second derivative /ubo/ ~ /ɨbo/ ‘beet’ in Udmurt, this probably still existed in Proto-Udmurt too, perhaps up to the time of the syncope rule.
[8] Unlike e.g. Hungarian, or typical older Indo-European languages, this contrast does not affect the choice of endings as such, only the stem morphotax. At a pinch, consonant-initial suffixes are added to a vowel stem, either /CVCɨ-/ or /CVCa-/; vowel-initial suffixes to a consonant stem, either /CVC-/ or /CVCal-/. It would be possible to treat the contrast also as one between underlying consonant stems and underlying vowel stems (/CVC-/ versus /CVCa-/), with /ɨ/ and /l/ inserted as prop vowel and prop consonant when required, though these are not anything like general morphophonological rules in Udmurt (the default prop consonant is /j/). — /ɨ/ is also syncopated in some positions in some dialects, roughly according to what medial consonant clusters Udmurt tolerates in general. This creates a more “natural” look for verb inflection (and in these dialects we definitely should speak of consonant-stem verbs). As a tangent: contrary to what most reconstructions claim, I however do think that this is indeed syncope, and more consistent vowel-stem inflection is the Proto-Udmurt and probably also Proto-Permic state of affairs. If dialect forms like /karnɨ/ ‘to do’, /punnɨ/ ‘to plait’ (even if nicely paralleled by Komi /karnɨ/, /pɨnnɨ/) were to be soundlawful precedessors of more widespread forms like /karɨnɨ/, /punɨnɨ/… why do words like /kɨrnɨ(d)ž/ ‘raven’, /tunne/ ‘today’ then not also turn into ˣ/kɨrɨnɨ(d)ž/, ˣ/tunɨne/? At most vowel insertion could be analogical, and this then fails to explain why the distribution of /CVC-/ versus /CVCɨ-/ in the “consonant-stem dialects” is quite consistently phonologically conditioned.

Tagged with: , , , , , ,
Posted in Reconstruction

Etymology squib: *äńćä ‘(rasp)berry’

A repeating complaint I run into with the more impressionistic reconstructions found in the UEW is the frequent use of *ŋ as a kind of a deus ex machina phoneme, reconstructed for all sorts of confusing correspondences of nasal consonants. One offender is the word for ‘raspberry’, given as *äŋɜ-ćɜ. No reflexes are known from Samic, Finnic or Samoyedic, which often spells trouble for working out the overall root shape; and a lack of reflexes in Finnic or Hungarian in turn spells trouble for coverage in the etymological literature. All the remaining Uralic branches still have reflexes though, ranging from Moksha to Southern Khanty, so there is probably still something native Uralic in here, not just late areal loanwords (in Mansi this is regardless a clear Komi loanword).

A casual look over the reflexes indeed reveals a clear /ŋ/ in Mari (Meadow /eŋəž/, Hill /əŋgəž/); and the Khanty reflex *-ääńć only appears as a latter member in compounds, and hence could be expected to have gone thru a bit more reduction than usual anyway. But this is about as far as we get before things stop working.

In Permic we find /m/: Udmurt /emedź/ ~ /emeź/, Komi /ɤmɨdź/ ~ /ɤmidź/ ~ /ɤmedź/ (probably < PP *ɛmedź). This is not an entirely unprecedented reflex of *ŋ, but usually development to *m only takes place adjacent to labial vowels, and even then should be usually retained in various Udmurt dialects. Compare e.g. PP *pɔŋ ‘head’, whence northern and central Udmurt /pum/, southern Ud. /puŋ/, most Komi /pon/, northernmost Komi /pom/.

Within Mordvinic, Erzya /ińźej/ ~ /ińźeŋ/ (also /ińdźej/ in Paasonen’s dialect data) could seem to suggest syllable contraction and POA assimilation similar to the implicit Khanty development: *-ŋVć- > *-ŋVź- > *-ŋź- > /-ńź-/? However, this fails in light of Moksha /ińəźi/, /ińiźi/, which has evidently escaped syncope, showing that the word was still trisyllabic *iNəźəŋ in Proto-Mordvinic. But medial *-ŋ- should then have definitely yilded **-j-! A development *-ŋ- > *-ń- that is probably being supposed here is otherwise entirely unknown in Mordvinic. Same would go for any kind of a suggestion of secondary epenthesis *-ńź- > /-ńiź-/.

Even in Mari there is the further problem that Hill Mari /ə/ does not regularly reflect PU *ä. This, however, could prove to be the key to the problem. We do find at least one other parallel for the correspondence MMa /e/ ~ HMa /ə/ word-initially, which I think is not accidental: the verb /eŋa-/ ~ /əŋgä-/ ‘to burn’ (from PU *äŋə-, demonstrating also that in Komi the expected reflex of *ŋ in a front-vocalic environment is /ń/). Raspberries happen to have a natural connection to fire: in the taiga zone, they are a typical pioneer species thriving in forest areas cleared by wildfire, sometimes in quite good abundance until crowded out by a shading tree cover. So I will suggest that the Mari name of the raspberry is in fact directly based on the verb stem ‘to burn’, and the “suffix” /-əž/ is what actually brings in the meaning ‘berry’. In light of Khanty, I will further suggest this is indeed an old compound, with the second member continuing a simpler root *äńćä ‘(rasp)berry’.

The other reflexes can be probably treated as compounds as well. In Mordvinic the first member could be perhaps identified with the first syllable of *ińďəŕ ~ *ińďəŋ ‘honeysuckle’ (which grows red compound berries similar to the raspberry) or *ińə ‘big’ (the raspberry grows fairly large berries even in the wild, unlike other culinarily important species such as the strawberry or blueberry). For Permic *ɛm- I do not have any etymology; however, a compound analysis seems to gain some other support from the fact that there is no real evidence for a “suffix” *-edź in Proto-Permic. And then among the very few words showing this ending, another berry can be found too: *pɛledź ‘rowanberry’. Usually this has been taken as a somehow heavily divergent reflex of PU *pićla ~ *pićrä ‘rowan’, and there could be some of this source still involved after all. But PP *pɛledź would also come quite close to a compound of *pȯl ‘time, instance’ + my already hypothesized *-edź ‘berry’. Hence: ‘many instances of berry = clustered berry’, as at least a folk-etymology? *ɛ ~ *ȯ are not quite identical, definitely not if read as IPA [ɛ] and [ɵ]; etymologically they are however both primarily reflexes of PU *ä, coincide in having /ɤ/ as their main reflex in Komi, and Zhivlov (2010) has even already shown that the two are in a complementary distribution in a fairly large number of environments; though the topic will probably still need some work.


A few other berry words have been reconstructed for PU as well, the clearest is *mura ‘cloudberry’, and also *pola or *pala ‘? lingonberry’ is quite probable. Finding any really new etymologies in this semantic area would likely however require first looking at each language group’s berry terminology as a whole… One precedent for this exists, a case study of Finnic berry names by Eino Koponen: “Itämerensuomalaisen marjannimistön kehityksen päälinjoja ja kantasuomen historiallista dialektologiaa”, 1991 (SUSA 83: 123–161); and what this reveals is absolutely rampant analogy, e.g. that Finnish puolukka ‘lingonberry’ has likely been rebuilt almost entirely after/in tandem with juolukka ‘bog bilberry, Vaccinium uliginosum‘, or Fi. vadelma ‘raspberry’ evolving in some dialects, after some other steps, into vaarain under the influence of muurain ‘cloudberry’. This possibility will probably need to be taken into account elsewhere too.

Tagged with: , , , , , ,
Posted in Etymology

What’s important for what in historical Uralistics

A question from an email discussion, the answers to which I think would be interesting to others as well:

Are certain branches more valuable than others when it comes to their relevance to Uralic historical linguistics?

I cannot offer any kind of rigorous rankings; only my own impressions, and they will not be the most detailed or best-researched. But they will be something I hope.

1. Vowel phonology
To this day based primarily on Finnic, Samic and northern Samoyedic — which are also the languages that best preserve unstressed vowels and bisyllabic root structure, giving an additional good reason to think they might be more archaic than the others. Mordvinic fits well in with F&S, Mari is messier, Permic and Khanty are huge messes (but still far from terra incognita). Hungarian and Mansi behave reasonably well again IMO, but this is really a bit hard to see from published research when almost everyone insists on comparing them with Khanty in the first place and not with the rest of Uralic. Southern Samoyedic is quite simply understudied, usually treated as just an appendix to Northern.

2. Consonant phonology
This has a relatively even basis, the rough details for every language have been known already since the late 1910s. I guess Permic has retained the most phoneme-wise distinctions, and Finnic + Samic followed by Samoyedic and Mordvinic are the most important for reconstructing consonant clusters, but every language matters for something.

3. Inflectional morphology
Also a relatively even playing field with the rough details well-known. Samic and Samoyedic could be said to stand out somewhat for having fairly archaic possessive suffix and case systems. Hungarian has almost completely upturned its noun inflection and Permic is not too far behind, but even in these the verbs retain a typical enough Uralic shape.

4. Derivational morphology
There have been some general overviews in the past, but this is a topic that needs more work all around. Word derivation has been described well only for Finnic and Hungarian (individually; not for the two of them in comparison, and not even for Proto-Finnic). Tundra Nenets clearly has the third-best coverage thanks to Tapani Salminen’s A Morphological Dictionary of Tundra Nenets, but then to my knowledge this has not been worked into any kind of a historic framework so far. I have the impression there’s some good literature in Russian at least on Erzya and Komi? No idea how much of a historical angle they would have.

5. Lexicon
No contest here. Finnish is surely the lexically best documented language in the world, with the Dictionary of Finnish Dialects archives covering 8.5 million records across perhaps 350 000 lemmas (contrast with “only” 300 000 lemmas in the Oxford English Dictionary, or 570 000 in English Wiktionary, despite orders of magnitude more speakers). Lexical documentation on Finnic in more widely is mostly in a pretty strong shape too, and so is etymology within the family. (For decades now most progress has come from loanword research though.)

Samic and Khanty are the next-most important. Huge dialect dictionaries have been available for a long time, and there is also a lexical reconstruction of Proto-Samic as well as an etymological dictionary of Khanty. Mansi could maybe eventually join this club if someone were to assemble a single comparative resource from all the individual ones. Komi and Udmurt are also similar but less diverse. Mari and Hungarian have been well documented, the latter also researched, but they are frankly not very rewarding due to lots of loanwords. Mordvinic falls between these and Permic. Samoyedic is documented quite heterogeneously, and also all reasonably large dictionaries other than for Tundra Nenets are very recent, there is surely going to be a lot of fodder for Uralic etymological research there still.

6. Syntax
There are some good areal overviews, but then even the theory of syntactic reconstruction is not very advanced. (Towards this goal, I believe that a lot of internal reconstruction work has been accomplished that is kind of “hiding” in generative syntax, and is just waiting to be rewritten in actual historical terms… but this is generally almost all on languages other than Uralic.)

7. Anything else?
There would be some other but less strictly linguistic approaches to “historical Uralistics”, e.g. poetry, mythology, other ethnography; genetics and archeology even. This all falls outside my expertise however. The most I can say is that all this was a big part of the Uralic / Finno-Ugric studies paradigm a hundred or so years ago. Today much less so, now that we know that language, culture and genes are not quite as tightly bound together as romantic nationalists once assumed. Of course I still warmly support following such neighboring disciplines and continuing to integrate their results with linguistics where applicable.

Altogether then we end up with the three most diverse branches Finnic, Samic and Samoyedic (esp. Nenets) represented the best in research; Hungarian and Mari represented noticeably poorly, unless I have missed entirely some kind of major sources.

Tagged with: , ,
Posted in Commentary

Etymology squib: *puj- ‘back end, point’

In the UEW we find a rough Proto-Ugric reconstruction *pukkɜ ‘blunt end of a tool’, with divergent later semantic development: ‘eye of needle’ in Ob-Ugric, ‘back of hammer/ax/knife/…’ in Hungarian fok. There is reason to suspect though that if related, these words do not go back to a simple bisyllabic root. Mansi *pup could be maybe in principle derived from *puK-p. However, *ɣ in Khanty *poɣ does not correspond to Hungarian k! The normal Khanty reflex of *kk is unlenited *k. [1] This discrepancy clearly shows UEW’s reconstruction to be overly impressionistic. Still, the comparison as such does not have to be abandoned: it can be instead approached as a family of three different derivatives, *puCV-ka, *puCV-pV, *puCV-kka with some lost weak medial consonant.

The identity of this lost consonant has been discovered by now, too. While rooting around for references on Samoyedic etymology, I have found that Helimski in an apparently little-known 2001 paper, “PU *i̮ś- ‘to cause to be, to be’ and some other core vocabulary items in Proto-Uralic”, [2] in passing connects to the Ugric words with a newly set-up Samoyedic *puj ‘blunt end of a tool (eye of needle, back end of sled runner)’. Loss of *-jə- in derivatives is regular in Ob-Ugric; in Hungarian the conditioning might be rather *uj > *u or *jkk > *kk > *k. The meaning ‘eye of needle’ as a Siberian Uralic semantic isogloss is interesting; maybe it is not a common innovation, but rather an archaism that has not survived in Hungarian.


Helimski however does not seem to have noticed that this new reconstruction allows a few further etymological connections too. At least Mansi *puj, Khanty *puuj ‘back part’ is obviously connectable as continuing the underived basic root of all the ‘blunt end’ reflexes. (The Khanty vocalism is, once again, difficult to explain though; maybe it’s a loan from Mansi.) UEW (s.v. *pujɜ) also gives several other generic spatial reflexes from Samoyedic that go back to *puə. Northern Finnic *poo ‘butt’ is often also considered a reflex, but this runs into phonological problems. My expectation would be for *pujə to yield instead *pui or at most *puu.

Given the new evidence of reflexes referring to tools, I would suggest that better Finnic cognates can be found, showing phonological development as expected. For one, there’s the word family including Finnish puikko ‘(narrow) stick, rod’, puikkari ‘net needle’, puikkaa- ‘to stick (in)’, earlier probably something like *’to poke with a blunt tool’. These can be taken as derivatives from a (pre-)PF stem *puikka. It would seem to be an exact equivalent of Hu. fok < *pujə-kka, but different meanings suggest they more likely have been formed independently. A different direction of derivation appears in the adjectives puikea ~ pujea ‘oblong = having a defined end’ < *puj-(k)əta. Furthermore pujo ‘narrow; narrow object’ could belong here (probably as a late derivative within Finnish, something like ancient *pujə-w I’d expect to give **puju). — SSA mentions a different etymology for the *puikka family: derivation from puu ‘tree, wood’ (or rather, from the plural stem pui-), but the supposed semantics in this seem too vague.

Now that there is much more backing across Uralic available, even the old comparison by Setälä of pujo with Samoyedic *pujå ‘nose’ could be rehabilitated, now on the root level: the latter seems to be analyzable as continuing something like *pujə-ja or *pujə-la, roughly ‘pointed thing’ (a relatively typical origin for terms for ‘nose’). The exact details of morphology will have to remain a bit up in air for now though… normally *-ja derives agent nouns, *-la local nouns.

Some phonological considerations on the development of *-Vjə rimes in Samoyedic will be also required, since the divergence between *puj ‘eye of needle, etc.’ and *puə ‘back part’ should be explained somehow. For now I will leave this to a few notes. On one hand, actually many reflexes such as Tundra Nenets /pū/, Mator hu-na- could still derive from either proto-form. It’s maybe conceivable that the two PSmy reconstructions could be just positional variants of each other; e.g. *puj as a self-standing noun vs. *puə- before further suffixes? But on the other hand, at least Nganasan /hüj/ ‘eye of needle, back end’ vs. /huə/ ‘back part’ are difficult to treat in this way. For now it seems more feasible to suggest that, despite looking like an underived basic root noun, the semantically derived *puj actually also goes back to some kind of a derived pre-form; options that would work without too many new assumptions could include e.g. *pujə-ka (akin to Khanty), *pujə-k, *pujə-j.

[1] The longest-known example is probably Kh. *ɭökəmə- ‘to push’ ~ Hu. lök-, Fi. lykkää-.
[2] Published in the workshop proceedings collection Budapesti Uráli Műhely II. This is based on (and covers some, though not all, of the same ground as) the unpublished presentation “Basic Vocabulary in PU and PFU: Remarks to Etymology and Reconstruction” that I’ve seen cited in a few places.

Tagged with: , , , , ,
Posted in Etymology

Two steps towards re-rooting Ludian phonology

Historical/comparative phonology of the Finnic languages has reached remarkably thorough coverage already in the mid-20th century. Nearly all major varieties and numerous smaller dialect groups (particularly but not only of Finnish) have had their specific history covered by at least a large article-sized special study, often indeed a monograph. Where there remains more to do, the issue is mostly of patches such as working out relative chronologies, pathways and areal patterns of change, or proto-forms of specific items.

There is however one case where a full rewrite would be warranted: Ludian, treated for ages as a “mixed Karelian–Veps variety”, but recently finally argued in detail by Miikul Pahomov (2017): Lyydiläiskysymys to be in essence instead a more conservative sibling of Veps. Or more strictly speaking, a cluster of such dialects: there seem to be no exclusively Ludian innovations that could be used to define this as a single language to the exclusion of Veps! The definition of “Ludian” has always been by a specific combination of retentions. E.g. going by the most immediately obvious phonological traits, Ludian retains Proto-Finnic *b *d *g as such (per older views: fortites *β *ð *ɣ to stops) (shared with Veps), but also retains long close vowels (shared with Northern Veps and all the rest of Finnic) and diphthongizes rather than shortens long non-close vowels (ie üö uo < *ee *öö *oo shared widely, ua < *ää *aa shared with Karelian and Eastern Finnish). Diphthongization per se remains an innovation here, but it’s too trivially areal to be worth anything for subgrouping. In fact a few sub-dialects even retain ää aa (so do again a few dialects of Karelian and Eastern Finnish), and one of the more poorly documented varieties appears to shorten them, as Veps does. Worth mentioning is also that even the speakers of what is usually called “Northern Veps”, and some of central Veps, in fact call themselves Ludians rather than Vepsians.

So the old two-part monograph on the historical phonology of Ludian by Aimo Turunen — Lyydiläismurteiden äännehistoria I (1946) on consonantism, II (1950) on vocalism — seems to now need an almost complete recontextualization. Perhaps E. Tunkelo’s Vepsän kielen äännehistoria (1946) could use some related updates too. What Pahomov’s work shows is that the Ludian and Veps varieties should be analyzed on one hand together, not separately; and that we should probably attempt a reconstruction of their last common ancestor as well. I will follow Pahomov in using the name “Old Ludian” for this.

A reconstruction of Old Ludian would probably be particularly interesting from a lexical point of view: e.g. how many Germanic loanwords have definitely made it this far by direct inheritance and cannot be treated as mediated by Karelian later on? How many exclusive shared Slavic loans are found? What unique derivatives or semantic shifts are there around? Questions of this sort will be somewhat hard to answer in detail as long as there is no dialect dictionary of Veps though. For Ludian there exists a sizable dialect dictionary Lyydiläismurteiden sanakirja (Juho Kujola, 1944), but on a closer look it is actually heavy only on the northern and central varieties that earlier research calls “Ludian proper” (varsinaislyydi), versus fairly light in the coverage of the southern, more transitional-to-Veps varieties that feature strongly in Pahomov’s argumentation. (He lists several example lexical isoglosses around pp. 163–166 though, but without clearly distinguishing innovations from retentions.) We can hope there to eventually be a dictionary of at least Kuujärv Ludian, the southernmost and today the most viable variety. Still at mere triple digits of speakers though, but aluckily including well-educated language activists like Pahomov.

But I think the new perspective on Ludian would likely force a few phonological reanalyses as well, especially if also keeping an eye back all the way to Proto-Finnic. I cover in the rest of this post two candidates.

1. Final vowels

Apocope is one basic feature that demonstrates well the heterogenicity of Ludian. Generally, all original word-final vowels are lost in northern Ludian, as also in all of Veps; in central and southern Ludian, non-open vowels are partly preserved, while final *a *ä are reduced and surface either as /ə/ (the probable intermediate stage before loss) or as /u/ ~ /ü/ (as in Olonetsian; this has been explained as a fortition from *ə due to the influence of an Old Karelian superstrate).

Original preconsonantal vowels are however uniformly preserved, including vowels followed by Proto-Finnic final *k. Yet, *-k has itself been lost everywhere in eastern Finnic. The fact that a contrast regardless remains in northern Ludian would at first look seem to demand reconstructing preserved *-k for Old Ludian. A few near-minimal pairs from the Sununsuu dialect:

  • *-ak > -a: PF *polkëdak > poɫgeda ‘to tread’
  • *-a > ∅: PF *valkëda > vaɫged ‘light, lit’
  • *-äk > : PF *imedäk > imedä ‘to suck’
  • *-ä > ∅: PF *pimedä > pimed ‘dark’
  • *-Ek > -e: PF *lähtek > lähte ‘spring’
  • *-i > ∅: PF *tähti > ťiähť ‘star’

But this is not the only option, and does not actually seem like the most parsimonious approach.

I would suggest that rather than forms like *polkedak, *lähtek, a good starting point would be a contrast between lax and tense vowels: *polkedà, *lähtè versus *valgedă, *täähtĭ. This finds some degree of evidence already from the central and southern varieties of Ludian, where some records do show reduced final vowels in bisyllabic stems. E.g. Kortaš akkᴇ̆ ‘married woman’, Nuomoiľ & Teru Priäžä ehtᴇ ‘evening’, Viidan & Kuujärv buťkɪ ‘umbellifer plant’, with devoiced ɪ (= IPA [e̥ i̥]), and explicitly marked as short in the first case; but e.g. Kš TP lähte, V Kj lähtə with no devoicing.

An IMO still more convincing parallel is provided by Natalia Kuznetsova’s recent research on Ingrian, most prominently in “Evolution of the Non-Initial Vocalic Length Contrast across the Finnic Varieties of Ingria and Adjacent Areas” (2016, Linguistica Uralica 52(1)), where she demonstrates that it is exactly through this path that some Ingrian dialects end up with the apocope of some but not all final vowels, and where reduced vowels can be observed to be devoiced next to voiceless consonants.

Devoiced reduced vowels have been reported from a third Finnic area showing apocope, too: Southwestern Finnish, cf. Ojansuu (1901: 24–25, 195–197). He however suggests a different mechanism for explaining the retention of final vowels in words that had PF *-k (and also *-h) — presence of a closed syllable “in some cases”. I presume this means mainly sandhi, so e.g. *läktek_meccässä > *lähðem_meθθäsä̆ > lähre mettäs [-mm-] ‘a spring in the woods’. This seems generally possible too, and may be to some degree complementary with the reconstruction of reduced vowels, but this would require an additional general survey of sandhi effects in the involved Finnic varieties.

I would propose reduced vowels can be furthermore connected with the fact that even standard Finnish shows allophonic V2 length difference between words of the shape CVCV and CVXCV. This allophonic pattern surely been also present already in Proto-Finnic, explaining why apocope with an identical counterintuitive restriction to CVXCV wordforms arises in all of Estonian, Southwestern Finnish, Ludian–Veps (ancient in these three), at least one variety of Tver Karelian (more recently), and in some dialects of Ingrian (no earlier than 19th century). Of course some of these may be areally connected, to each other or to common contact languages, but if given a “preadaptation” within the common prosodic system inherited from Proto-Finnic, they do not all need to be.

2. Consonant-stem infinitives

Most of Ludian differs from Veps in not undergoing much syncope at all. Pahomov suggests a few exceptions however, including the infinitives of some d-stem verbs: anda- : ant(t)a ‘to give’, kanda- : kant(t)a ‘to carry’, ruada- : ruat(t)a ‘to toil’; though also andada, kandada occur in central Ludian.

This is syncope alright; but it does not seem to be a specifically Ludian phenomenon. As is know, though perhaps not widely, forms such as *antadak must have had already in Proto-Finnic syncopated byforms such as *attak. [1] This is shown e.g. by Old Finnish infinitives such as lentä- : letä ‘to fly’ < PF *lettäk < *lent-täk < *lentä-täk, lähte- : lätä ‘to leave’ < PF *lättäk < *läkt-täk, tietä- : tietä ‘to know’ < PF *tiettäk < *tietä-täk, not derivable within the specific phonological development of Finnish. In particular the simplification of *ntt to *tt (as seen in the first) is clearly no longer productive. The restriction of consonant-stem infinitives to A- and e-stem verbs but not i-, O-, U-stem verbs moreover clearly suggests that it has arisen earlier than the pan-Finnic development of unstressed *i, *o, *U from *əj, *Aw, *əw. While no **ata, **kata are attested in Finnish (only the regular vowel-stem antaa, kantaa) or anywhere else that I know of, this is likely to be simply due to these forms falling away to oblivion: after all modern Finnish today only knows one formation of this type, tuta ‘to know’, and then only as a fossilized relic in expressions, while the productive infinitive is exclusively tuntea.

Directly inherited infinitives of this type are actually widely found in Ludian. This is not a large surprize, since across the eastern Finnic area they have been reported sporadically from Olonetsian and productively from Veps already since Setälä. Besides ruat(t)a, perusing LMS turns up among bisyllabic d-stem verb roots also at least the following:

  • kieldä- : kielt(t)ä ‘to deny’ (< PF *keelt-täk)
  • kiändä- : kiät(t)ä ‘to twist, turn (tr.)’ (< PF *käättäk < *käänt-täk)
  • kuada- : kuat(t)a ‘to pour’ (< PF *kaat-tak)
  • lendä- : let(t)ä ‘to fly’ (< PF *lettäk < *lent-täk, cf. above)
  • löudä- : löut(t)ä ‘to find’ (< PF *leüt-täk)
  • (? nouda-) : nouta ‘to follow’ (< PF *nout-tak)
  • püuda- : püut(t)a ~ püudada ‘to hunt’ (< PF *püüt-täk)
  • siädä- : siät(t)ä ‘to do’ (< PF *säät-täk)
  • sorda- : sort(t)a ‘to fell’ (< PF *sort-tak)
  • souda- : sout(t)a ‘to row’ (< PF *sout-tak)
  • tiedä- : tiet(t)ä ‘to know’ (< PF *teet-täk)
  • tunde- : tut(t)a ‘to know, feel’ (< PF *tunt-tak)
  • vierdä- : viert(t)ä ‘to burn extra wood at a slash-and-burn field’ (< PF *veert-täk)
  • viändä- : viät(t)ä- ‘to dance; to bend’ (< PF *väättäk < *väänt-täk)
  • uurda- : uurtta- ‘to carve’ (< PF *uurt-tak)

In this light, also ant(t)a and kant(t)a are unlikely to represent irregular late syncope from andada, kandada: they should be instead considered analogical reshapings of inherited *atà, *katà! with simple reintroduction of -n- from the vowel stem. The same analogical reintroduction of -n- is found also in ‘to plow’ (kündä- : künt(t)ä ~ kündädä), ‘to push, send’ (tüöndä- : tüönt(t)ä); and without gemination, a similar analogical reintroduction of -h- is found in ‘to leave’ (lähte- : lähtä ~ lähtedä).

In tA-stem verbs though (i.e. in those original *tA-stem verbs where the preceding consonant was voiceless, preventing voicing), only vowel-stem infinitives seem to occur: (-ht-) ahtada ‘to set up’, kiehtädä ‘to bother’, puahtada ‘to roast’, puohtada ‘to clean grain’; (-st-) kastada ‘to dip’, kestädä ‘to stand’, nostada ‘to lift’, püštädä ‘to stick’; (-tt-) ďiättädä ‘to leave’, keittädä ‘to cook’, ottada ‘to take’, suattada ‘to accompany, transport’; (-ɫtt-) poɫttada ‘to burn’. [2] This probably means that the origin of OFi. lätä, Lud. lähtä is to be dated to an even earlier stratum than the Proto-Finnic reduction of *A in the context *t_t. As is quite probable: Proto-Uralic *läkt(ə)- is also e.g. the only consonant-stem verb in Mari that ends in a cluster of two voiceless consonants, and the only consonant-stem verb known in Udmurt at all.

Lastly a second regular vowel-stem infinitive group consists of iďädä ‘to germinate’, pidädä ‘to keep’, vedädä ‘to pull, draw’: reduction and loss of unstressed *-A- following a light stressed syllable is not expected/precedented in any morphological context at all.


Neither of these examples really ends up changing much about the reconstruction of Proto-Finnic itself. The first is on the phonological level indeed simply a “patch” for the development path of apocope, though it is also one piece of evidence for the reconstruction of allophonic vowel length. The second seems to provide the first attestations pointing even indirectly to the infinitive forms *attak, *kattak, though examples like letä, tuta also already allow hypothesizing such forms anyway. But the take-home seems to be that while the segmental phonology and morphology of PF are well-known by themselves, two areas suitable for further work would be prosody and morphophonology. Both of these incidentally also become much less charted territories when looking further back towards Proto-Uralic.

[1] At minimum; it is conceivable that the vowel-stem infinitives are altogether later analogies.
[2] Old Finnish too seems to show no examples of consonant stems for verbs in -htA- or -sta-. The few examples for verbs in -ttA- could be themselves analogical, sometimes even misinterpretations.

Tagged with: , , , , , ,
Posted in Reconstruction

Phonology squib: raate

The standard Finnish word for the buckbean (Menyanthes trifoliata) is raate. This word often appears in overviews of Finnish historical phonology as a supposed example of irregular development of early Finnish *ð. Sure enough, dialect forms like Satakunta rarake, Tavastian ralake definitely point to *raðakeK (where *K ∈ {*k, *h}), while transitional and eastern dialects’ raate ~ roate ~ ruate would be regular from *raðateK. Same goes for Karelian roateh, which appears to identify the word-final consonant as *h. Northern Ostrobothnia also shows a “bridging” form raake, seemingly from *raðakeh.

However, there are a few general problems with this:

  • for single *-ð-, more often it is western forms with -r- or -l- that spread beyond their expected borders, not eastern forms with loss; [1]
  • there seems to be no substantial eastern dialect evidence for the form *raðakeh;
  • the variation between *-keh and *-teh remains unexplained.

I propose that the forms with ⁽*⁾aa actually do not result from loss of *ð, irregularly in the west; they result from an early POA dissimilation of *raðak̆keh to *raɣak̆keh. This, then, would’ve set off a further “suffix dissimilation” to *raɣat̆teh in Eastern Finnish ~ Karelian (and we now require also distinguishing *-P̆P- from *-P-, given their distinct reflexes in southern Karelian).

As long as the origin of this word remains unknown beyond “northwestern Finnic” (Finnish–Karelian [2]), in principle *raɣakeh could even be the oldest form, with *raðakeh being due to regressive dissimilation from *ɣ-k, rather than progressive dissimilation of *r-ð. This combination is indeed otherwise generally tolerated: rata : radan ‘track’, retu : redun ‘dirt’, riita : riidan ‘quarrel’, rita : ridan ‘trap’, rotu : rodun ‘race’, ruoto : ruodon ‘fishbone’ do not show any similar irregularities. Of course though, in cases like these strong grade -t- would have provided analogical support; in principle we could assume even regular dissimilation once upon a time, with all the evidence other than raate then analogically reverted. [3] It also seems quite likely to me that the dissolution of western Finnish has begun in the southwest, and that therefore we should not expect to find major common innovations across the area … which we indeed don’t, aside from the general areal features that are normally used to define the Southwestern dialect zone (lack of Tavastian *ð > l and *CɣV > CVV, heavy syncope and apocope) and some commonalities that are analyzable as shared retentions from an older era (e.g. plural genitive in *-den, *-ten rather than *-iden; *suvi ‘summer’ vs. *kesä ‘fallow’; numerous vocabulary items shared with Estonian).

[1] On the other hand, with *-hð- there are a number of examples tending towards loss even in the west; known examples appearing in standard Finnish include ehättää ~ dial. ehrättää ‘to reach in time’ < *ehðättää ← ehtiä ‘to be on time’; lähettää ~ dial. lährettää ‘to send’ < *lähðettää ← lähteä ‘to depart’; kohentaa ~ dial. kohrentaa ‘to improve, to fix an object’s placement’ < *kohðentaa < kohta ‘place’; and the derivative suffix -auttaa ~ -ahduttaa -ahtaa.
[2] IMO a more likely grouping than a primary division into Western Finnish vs. Eastern Finnic including Ludian–Veps. Features that Karelian shares exclusively with Ludian or Veps can be mostly attributed either to late Russian influence, or to contact with Old Veps (maybe better called Old Ludian) and the later Ludian varieties. In turn, Eastern Finnish and Ingrian, whose close affinity with Karelian is very obvious, have almost nothing shared with Ludian–Veps that would require setting up an Eastern Finnic proto-language. All the “Karelid” varieties however show several features that are absent entirely from “Far Eastern Finnic”, yet shared with western Finnish — phonemic consonant gradation being maybe the most conspicuous feature.
[3] In principle reuhka ‘poor hat’ could be maybe derived as *reðu-hka > *reɣuhka; but this would be much too inexact semantically compared to the perfectly fine loan etymology from Russian треуг.

Tagged with: , , , ,
Posted in Etymology, Reconstruction

On Out of Eurasia and linguistic time depth

So here’s the hypothetical (as developed previously). Suppose modern humans have been hanging out at least somewhere around Eurasia already for 100, perhaps 200, maybe as much as 300 millennia, instead of merely 50–70. Should any of our views on the history of language(s) be affected?

A basic immediate result is that this substantially increases the time depth available for the language families around. This includes not just the known and proposed ones — but also the undetected ones that anthropology and genetics tells us will have to exist. As established before, an Out of Africa theory of modern human origins demands that ~all languages of outside Africa ought to go back to a single common ancestor about 70,000 years ago, since the Near East creates a natural bottleneck for early, pre-naval migrations. An alternative Out of Eurasia however does no such thing. It does suggest the existence of a few linguistically unconfirmed macrofamilies like African, Amerind or Australian, but these do not need to go back to any especially closely related Eurasian ancestors. These in turn do not have to be especially closely related to any modern Eurasian language families either.

If so, a failure to detect any relationship between even geographically relatively close-by families such as Sino-Tibetan and Indo-European, or Semitic/Afrasian and Sumerian, does not have to mean that the Comparative Method is therefore likely to run out of steam after 10 or even 20 millennia. Maybe the effective limit is much deeper, but it is also the case that these kind of “patently unrelated” languages really have been separate from one another ever since the Lower Paleolithic.

Eurasia houses also the great majority of the world’s well-studied language families. (In this context I would count also Austronesian as a “Eurasian family”, given its homeland in Taiwan, or maybe adjacent continental China. [4]) Those elsewhere have been documented and reconstructed on average much more scantily. A few blazing successes such as Bantu or Algonquian also suggest that there are many more results left to be claimed. It is therefore only in Eurasia that we can really with decent confidence claim that plenty of the language families appear to be unrelated or only vaguely related. Macrofamilies elsewhere in the world, such as Amerind or Australian, cannot be decreed invalid just on the basis of the so far poor results for macrofamilies across Eurasia!

It is every so often also claimed that linguists working with African languages in particular would tend more towards “lumping”, while linguists working with Eurasian languages would tend more towards “splitting”. I don’t think this is fair, except perhaps if we define “lumping” and “splitting” purely by the size of the language families involved. If African macrofamilies appear to have about as much evidence for them as Eurasian macrofamilies do now, when the languages of Africa are far less documented and researched, then I think we can expect the evidence base to keep growing. Over the 21st century I expect further solid reconstructions and new perimeters to be reached.

On a historiographical note, I would also like to briefly note that while “lumping” is often blamed on Joseph Greenberg and his alleged uncritical followers, almost all of his macrofamilies had been proposed or at least explored already earlier on. He may have brought in some new annexations and new unwarranted confidence, but the concept of “macrofamilies” has in principle nothing to do with Greenberg’s barely-method of mass comparison. [5]

I expect also one cross-linguistically important result to eventually emerge from ongoing research on the major African language families in particular. Besides big, these are likely to be fairly old… Hence, if we have one day a detailed reconstruction of Proto-Niger-Congo or Proto-Macro-Sudanic (“Nilo-Saharan” at its widest I do think is a wastebasket taxon), and their development to a few fairly distant languages, this will be able to show us what a linguistic relationship that’s 10,000+ years of age really looks like! I am not confident that this would have to turn out as minimal as is often claimed.


It is an unfortunately common notion that a linguistic relationships of 10,000+ years of age would have to be undetectable. This seems to come from two main sources, both of them IMO fallacious. The first is naive extrapolation from examining relationships maybe halfway or third as old. Proto-languages of this age we can reconstruct, but only partially. E.g. in terms of lexicon: in any language that is still spoken, tens of thousands of words can be attested, while in multi-millennia-old proto-languages only some hundreds can be reconstructed. So should we not assume that further two, three or four time periods equally long should squeeze the available evidence down to definitely nothing? Well, maybe — if you’re willing to bite the two bullets that (1) glottochronology works at least on long enough time periods, but (2) there is no core lexicon or grammar, and everything is equally likely to change. Without the first, punctuated equilibrium models will allow for the possibility that some languages may have remained, at times, mostly stable for millennia. Without the second, you will have to admit that languages’ core features actually remain stable much longer than their peripheral features, [6] and they will likely allow reconstruction efforts well beyond what naive extrapolation suggests.

The second error is based on the essentially winged age estimates for Eurasian macrofamilies. This is already internally incoherent, though. If a maybe-family like Nostratic is proposed to be in the age range of 10,000 years, and if the Nostratic proposal is too weak to be accepted… then it does not follow that all language families in this age range will be too weak to be accepted: rather, this means that the age proposal for Nostratic is, together with the family itself, also too weak to be accepted. What does not exist, cannot be dated either. (Did Cthulhu invent cephalopody before or after the molluscs did?)

One possible compromise for this would be to treat (again, e.g.) Nostratic as a language family that can be accepted through archeological / cultural-anthropological / genetic arguments, but not linguistic ones. I don’t think anyone really believes in this exactly, though. I only ever see arguments to the effect that, if Nostratic exists, based on cultural & genetic distance, then it will have to be at least 10,000 years old (and this much is easy to agree with). But this is only a lower bound! It gives us no evidence about an upper bound. Since language shift and cultural convergence exist, there is no linear or even monotonic relationship between linguistic distance and cultural-genetic distance. After a few language shifts and centuries of convergence, the modern citizens of Haparanda and Tornio are just about indistinguishable by cultural and genetic distance. Yet by linguistic distance, one side still remains Indo-European, the other Uralic.


Finally, most of the various ideas above can be pooled together as a provocative new hypothesis for thinking about the Eurasian “maybe-families” and their place in the general context of “macro-comparison”: the vague resemblances we see in e.g. Nostratic could be indicative of what language relationship after 100,000 years looks like. That is, it might turn that the reason macro-comparison has been thought to be mostly fruitless is that, the entire time, most linguists have been actually beating their heads against the hardest such problems around! Once we have Indo-European-level (Uralic-level? [7]) documentation of most languages involved, maybe not just units like Atlantic-Congo or Trans-New-Guinea, currently established on fairly meagre evidence, but also even much older and larger units, say Niger-Saharan or Southern Amerind, will turn out to be relatively feasible to establish and reconstruct. Only future work will tell for sure. [8]

(To be continued still…)

[4] As suggested by Blench in various draft papers. Also the today quite well-emerging Austro-Tai theory would fit in with this: given probable continental relatives as well, Taiwan may simply constitute a residual zone populated by Austronesian groups driven off of or extirpated on the mainland by the Chinese.
[5] Even mass comparison can be probably steelmanned, but that’d be a topic for another time.
[6] Cf. the fact that radioactive isotopes’ half-lives come as anything between billions of years and some fractions of picoseconds.
[7] It seems possible that Uralic is actually the best-documented large language family out there, if “large” is defined to cover both diversity and time depth. There’s a grammar for everything (even if from the 1800s for some languages), and at least one extensive multidialectal dictionary for almost everything (usually more; oldest unsuperceded ones are from around 1940; biggest omission is probably Veps); multiple etymological dictionaries and historical grammars for every language with a longer written tradition (well, all three of them), and a bunch of them even for minority languages. Indo-European definitely sports the best-documented individual languages out there, but the family-wide average is killed hard by modern Indo-Iranian languages, which AFAICT have been essentially deemed “comparatively useless” to document due to Sanskrit and Avestan being available. Semitic fares somewhat similarly (weak points: Ethiosemitic, Modern South Arabian). On comprehensiveness + diversity, several other (once again usually Eurasian) families like Japonic and Turkic are doing well or getting there, but all of these are younger. On comprehensiveness + time depth the only contender I can think of is perhaps Kartvelian, which does poorly on internal diversity.
[8] Provided that people will not write out even the prospect of such future work by unsubstantiated assertions about the Comparative Method “stopping working” after some random number of millennia. But, I also think it doesn’t really matter if this belief is being held by some people currently… It will be a good idea anyway to get a decent reconstruction of Benue-Congo before trying to reconstruct Atlantic-Congo or the whole Niger-Congo, and also to get a decent reconstruction of Bantoid or Edoid before trying to reconstruct Benue-Congo. And if I am right about families like NC being able to eventually provide us with explicit examples of languages being demonstrably related over 10,000+ years, then the skeptics are not going to be denying any of the results happening along the way; they’re going to be gradually retracting their supposed time limit, until it turns out to be deep enough that it can be no longer used to flatly deny other similarly far-away but less obvious results like Dene-Yeniseian either.

Tagged with: , , , , ,
Posted in Methodology

Excursion: On Out of Africa

Out of Africa (OOA) has been the main theory of the origin of modern humans since the mid-20th century. Strictly speaking this is only a theory of anthropology. Since language is a human phenomenon, [1] it has however also sprouted a “linguistic Out of Africa” theory alongside.

According to what could be called “the evolutionary theory of language”, we observe that new languages only come about by the spreading and splintering of earlier languages. (Or, perhaps a better biological analogue still is the third tenet of cell theory: that cells only come about from other cells.) This alone already suffices to imply that there exists a family tree of languages, tracing back to the ancient era of glottogenesis. Connected with a relatively late (“recent”) expansion of modern humans out of Africa, we can then in particular infer the highly likely existence of a language that could be called “Proto-Exo-African” (PEA) — the language of the humans who first set on this exodus, which must also have been a common ancestor of all languages spoken in Eurasia, Oceania and the Americas. This is an idea that is in principle sound, even if, in my impression, underappreciated among historical linguists. The smaller number of so-called evolutionary linguists out there do understand it well, at least.

This argument though says nothing about if this common descent would be in any way identifiable from the linguistic data itself. Language does not have a strict analogue of DNA (or any other similarly transferred major biochemical machinery), and is not strictly speaking “transmitted” as much as “constructed” over and over again every generation. No child is born knowing a language, only with the ability to acquire a language. This could add up to the result that every linguistic feature of PEA has been by now either lost or diluted to undetectability. And it happens to be the case that all language families with general approval so far are still at least an order of magnitude younger than the assumed recent OOA spill-out starting some 70,000 years ago. Even the more ambitious proposals like Amerind or Nostratic (that actually have some legitimate comparative evidence backing them, unlike attempts to scrape together things like Proto-World) are only proposed to reach at most some 20-30 millennia of age, i.e. barely a third of the way back.


If PEA might be undetectable by direct means, how error-proof is the indirect demonstration of its existence then? As long as we do not question the underlying OOA theory, there are really only four possibilities under which the assumption of PEA might fail:

  1. humans leaving Africa did not yet have language, and it has come about only later, possibly several times independently;
  2. at some point in prehistory, new languages were created from scratch to replace earlier natural languages, and some or all modern languages rather descend from these “new” languages;
  3. the African exodus population spoke more than one language;
  4. at some point in prehistory, other (possibly entirely unrelated) language families have secondarily spread out of Africa, to replace some or all descendants of PEA.

Of these, #1 is difficult to directly refute. Spoken language does not fossilize, and hence the study of the biological evolution of language is to a large extent an issue of speculation. The most common opinion around however is that language would have existed at least by the transition to anatomically modern humans (AMHs), as distinct from Neanderthals and the newly found Denisovans, so at least a couple hundreds of millennia ago. In this post series I will continue to follow this assumption as well.

All the others, however, would not make major dents in the hypothesis of monogenesis of non-African languages.

#2 is a priori improbable, and hence not actually a major objection. If we take seriously the rarity of languages being freshly invented (i.e. stick to the principle of uniformitarianism), then even recent glottogenesis events will actually only leave a slightly weakened OOA theory of language, one where we can allow for e.g. Basque or Turkic to have been created from scratch, but all other non-African languages can be still assumed to be descendants of PEA. Similarly #3 would only split the family tree into a small copse of unrelated-at-OOA-time trees, probably at most no more than 3-4. These would have good chances of being still related to one another at some pre-OOA date, so that there is a PEA in the last common ancestor sense, even if confusingly enough it was not itself exo-African (and could be therefore also the ancestor of various African languages).

#4 actually has good chances of being true: maybe the best contender for non-PEA languages spoken in Eurasia today are the Semitic languages in the Levant and Mesopotamia, grouped in the larger Afrasian family, whose homeland is often (though not always) placed somewhere in northeastern Africa. But we can also see that this too would only very slightly push back the boundary of African vs. exo-African languages. Languages spread only step by step. Perhaps some other lineages in the vicinity of Africa, say Sumerian or Dravidian, could be also of yet more recent African origin, but once modern humans had first colonized places like Siberia and Southeast Asia, new intrusions all the way from Africa are unlikely to happen.

Altogether even a buffer zone of maybes doesn’t seem to shake the conclusion that the 100+ exo-African language families known today to linguistics (including isolates) must be only a few top boughs of at most a handful of much larger underlying language families, dating back to the time of the OOA expansion.


But has there really been a recent Out-of-Africa expansion?

There definitely has been at least one OOA event, since also the Neanderthalians, Denisovans and Homo erectus are thought to descend from African hominins. For recent modern human OOA in particular though, the main line of hard evidence has originally rather come from the fossil record. 200k years old Homo sapiens remains from Omo Kibish in Ethiopia, the oldest known throughout the late 20th century, have been some of the best supporting evidence. This is however but a single archeological datapoint — and one that has been overturned even. Since last year, the oldest known remains of AMHs now come instead from Jebel Irhoud in Morocco, dated around 300k years old. Note that this is quite a gap, both chronologically and geographically! The data doesn’t get especially dense going forward either. Only a handful of any modern human remains under 100k years of age are still known. More importantly this slightly extended selection already includes locations also outside Africa, in modern Israel and Oman (discovered not too long ago as well). These have been suggested to represent “failed migrations that died out”, but this strikes me as special pleading. I dout that anyone looking at this scattered early record without the weight of research history (and with understanding of the Signor–Lipps effect) would place the origin of anatomically modern humans within Africa with great confidence. At minimum a Near Eastern origin seems to be entirely within the question as well. Paleoecology could be able to suggest other likely locations still.

Several posts by anthropology blogger Dienekes have moreover drawn my attention to a few interesting additional arguments to consider recent OOA on very shaky ground by now. For one, recall that modern humans’ closest known relatives are the Neanderthalians and the Denisovans — two Eurasian species, with Denisovans branching off first, which would suggest that the common ancestor of the three, and even the Neanderthal-AMH last common ancestor, lived in Eurasia as well. (This does not need to coincide with the LCA of crown group AMHs, however.)

For two, genetics has for long pointed out that the modern human populations of sub-Saharan Africa [2] show altogether greater genetic diversity than those of Eurasia (+ with even further rarification in Oceania and America). However with the rapid development of archaeogenetics in the last few years, we have by now first lines of evidence that this could be due to admixture with archaic Homo sapiens groups.

Maybe the Neanderthal and Denisovan OOA event was then also the main modern human OOA event after all?

This would also imply at least two inverse “into Africa” expansions (one leading to archaic African substratal groups, the other for crown group AMHs). This does not seem to be a very costly assumption though, since the distribution of several archaic haplogroups already demands multiple major population movements across the continent. E.g. the archaic mtDNA haplogroup L0 of South Africa is both first-to-branch-off and present in only fractional proportion in the populations that do carry it, clearly requiring multiple admixture events along the way (instead of, say, a Great San Migration that starts 100k years ago followed by them hanging out in South Africa mostly intact after that). There are also haplogroups with primarily Eurasian distribution but some inroads even into sub-Saharan Africa, requiring their own more recent but still prehistoric into Africa or out of Africa movements or gene flows (e.g. Y-haplogroup T, mtDNA haplogroup U).

An option to be also kept in mind is haplogroup extirpation: various today-African haplogroups may have once existed in Eurasia too, but eventually died out, e.g. under later population movements. The same could be the case anywhere of course, but to me it seems that Eurasia is a priori the most likely location for this. For one already due to size, for two due to the historically well-documented extensive population movements, particularly across and around the major “crossroads” that is the Near East. [3] A third and maybe the most powerful candidate for wiping out genetic diversity from Eurasia would be the latest glaciation period. Still, the human Y-chromosomal and mtDNA haplogroup trees both start off with a remarkably large series of exclusively African early groups. Any theory of AMH origins outside of Africa would have to explain most of these through archaic admixture, with haplogroup extirpation probably only granting some wiggle room around those points in the two family trees where the branches start to turn Eurasian-centric instead.


I’m aware I’m outside my zone of expertise here, so in case this sounds like I am suddendly a few moomins short of a valley, I do want to note that overturning recent OOA is still not looking cut and dry exactly (and if any of the arguments above have big glaring holes in them, I would appreciate readers pointing them out). If you bear with me for now, though, I will develop in the next post some potentially very interesting corollaries this possibility would have.

[1] It may be at times useful to think of linguistics, likewise most other humanities subjects, as a subdiscipline of anthropology (in a somewhat similar way as how biology could be considered a subdiscipline of chemistry).
[2] Important as a human genetic and cultural area, and to some extent also as a more general biogeographical region. North Africa by contrast on many marks aligns with Eurasia instead. I have wondered if sub-Saharan Africa would deserve its own underived term, along the lines of “Maghreb” for North Africa. There are a few historical candidates, but nothing that really stands out as immediately usable: Ethiopia and Sudan have been already claimed by states (much as also Libya), and Zanj also has been kind of claimed by Tanzania. The analogy of Australia could suggest e.g. “Meridionalia” or “Equatoria”, but outright coinages have quite a few orders of freedom to them.
[3] Later also the “highway” that are the steppes, but my impression is that pseudo-periodic nomad invasions of Eastern Europe / Persia / China have only really been a thing after the domestication of the horse, and related inventions such as chariots, saddles, stirrups etc., all fairly recent in the big scheme of things.

Tagged with: , , , , ,
Posted in Uncategorized

A Problem Statement for Uralic vocalism

As noted in my previous post, I have by now nailed down as my next professional milestone a hunt for previously unnoticed innovative features within the Finnic vowel system.

Besides individual surface questions about how the vowel system of Proto-Uralic may have looked like (harmony this, stem vowels that, long vowels yay or nay…), there is also a second, more methodological theme involved that may be less apparent. This is the question of how to reconstruct a proto-language when faced with extensively overlapping correspondences.

Uralic vowel reconstruction is not really constrained by data. The etymological pool has been sitting at around 1000 items already since the late 19th century. Etymology has kept progressing over the time, but almost as much of it involving discarding old poor comparisons as adding new better ones, and hence with surprizingly not that much quantitative impact (an optimistic count would put us as having gone from about 900 to about 1200). Yet it still remains the case that effectively no two etyma fully agree in their correspondences! Even seemingly perfect rhyme series found in just about all languages show divergence in one or two languages. E.g. for PU *kala ‘fish’, *pala(-) ‘bit, to bite’ and *sala- ‘to steal’ it is Khanty and Selkup that diverge, both in different items even: *kuuL, *puuɭ, *ɬaaL-; *qwëlɨ, *poolɨ-, *twëlɨ-. [1] More often, numerous gaps in data prevent assigning correspondences definitely together as series: a given sparse correspondence set may be simultaneously compatible with three other sets, which however all disagree with one other. In other words we are saddled with too many correspondences to straightforwardly tackle. This all has already been noted before too, e.g. by Kaisa Häkkinen in 1983, in her PhD thesis Suomen kielen vanhimmasta sanastosta ja sen tutkimisesta, pp. 120–151.

My reading on the research history seems to moreover reveal that it’s tackling this issue that has been driving all the major debates on Uralic vowel reconstruction thru the years. There have been roughly four approaches considered throughout the years, all of them in principle admissible per known processes of language change:

1. Reconstruct different vowels for each correspondence (the “trivial reconstruction” approach). This was briefly attempted in the late 1890s, within the West Uralic (Samic–Finnic–Mordvinic) group. E. N. Setälä proposed at this time (see p. 839– here) reconstructions such as the following:

  • *ȧ > S *ā ~ F *a ~ Mo *a
  • *å > S *uo ~ F *a ~ Mo *a
  • *ɔ > S *oa ~ F *o ~ Mo *u
  • *o > S *uo ~ F *o ~ Mo *o, *u
  • *ɔ̄ > S *oa ~ F *oo ~ Mo *u
  • *ō > S *uo ~ F *oo ~ Mo *a

However, any attempt to extend this method wider out will turn out to require further and further splintering, and by the time we end up with a triple-digit number of different proto-vowels, this idea will be clearly untenable.

2. Assume original vowel alternations, with levelling in each descendant. This idea was also initiated by Setälä very shortly afterwards, indeed already explored in the same article I linked, and gained maybe in its purest form by T. Lehtisalo in the 1930s. [2] In his work e.g. what Setälä above reconstructs *ȧ becomes *ā; but *å is transformed into *ā ~ *ò, *ɔ into *ò ~ *ù, *o into *ò ~ *ō ~ *ū, *ɔ̄ into *ō ~ *ū, and *ō into *ō ~ *ā. Most of his various proto-vowels actually never exist outside such pseudo-ablaut patterns.

After WW2 the “locus” of this line of reconstruction moved from Finland to Germany, with W. Steinitz defending his own variant of the idea extensively in the 40s thru 60s. No real research on the topic has occurred since his death however. (Amazingly enough, it still lingers though in some overviews of Uralic penned by people who evidently ignore all research published outside of the German-speaking world.) I see this as unlikely to be effectively revived either: the only Uralic language showing somehow productive evidence for “ablaut” is Khanty, while everywhere else alleged evidence for vowel alternation is either due to transparently secondary changes, or is really based on sound correspondences rather than language-internal evidence.

3. Assume sporadic vowel changes, per ad hoc influence of varying surrounding phonetic environments: anything adjacent to labials might be labialized or delabialized, anything adjacent to velars might be backed or labialized, anything adjacent to /r/ might be lowered or backed, etc. No one has treated this as the sole explanation for various vowel correspondences across Uralic, but this was considered a major mechanism first by E. Itkonen, whose work ended up repealing the Setälä school “gradation” model in Finland, yet ended up enshrining a very Finnocentric image of Uralic vocalism (cf. before).

Today this approach most strongly still persists in Hungary. One reason surely is that this is the model that has been adopted in the UEW, often treated as the crown jewel of Hungarian Uralistics, and whose tentative reconstructions are then sadly often treated as ex cathedra truth. I suspect a second reason is moreover found in language-internal history: the Modern Hungarian vowel system cannot be derived from that of Old Hungarian by regular sound changes — if taken at face value. However the very limited inventory of Old Hungarian vowel graphemes (in first sources just ‹a e i o u›, slightly later expanded to ‹a e i o u ü›, etc.) very likely hides unwritten distinctions. [3]

4. Attempt to reconstruct conditional vowel shifts. First explored already by A. Genetz contemporarily with Setälä, and by now universally adopted e.g. for much of the West Uralic data: Setälä’s *ɔ, *o turn out to be in complementary distribution with respect to stem type (*o-a versus *o-ə, split only in Samic). Recent wisdom shows this to be mainly the case for his *å versus *ō likewise (*a-a, *aCCə, *aTə versus *aRə, split only in Finnic). This then also explains extremely naturally the identical reflexes in Finnic and Mordvinic for the former, in Samic and Mordvinic for the latter: they don’t just coincide, they’re always had the same vocalism.

More generally, this approach is adopted to some extent by everyone more recent than Lehtisalo (including Steinitz and Itkonen), but often only partially. I believe it can and should be still pushed further to reach new results.


It must be also noted that these are not methodologically equal approaches.

The first approach does make exact predictions, and is to an extent obligatory: we do need to assume some number of vowel phonemes in Proto-Uralic, and some unconditional/elsewhere reflexes for them in the daughter languages. [4] But the vast number of correspondences demands also some other mechanisms to account for the large number of non-core cases (not really “edge cases” when they may be the majority altogether). While a few of them could be in principle again accounted for by setting up new Proto-Uralic vowel phonemes, this method ends up as awfully arbitrary: we have no clear grounds to prioritize any single case of variance in reflexes as inherited from Proto-Uralic, while leaving other cases of variance to be explained by other methods. In fact I think by now that reconstructing any proto-language contrasts at all from only a single branch among several (i.e. at least in largely polytomous-looking dialect-continuum/linkage situations such as Uralic) is methodologically illegitimate — while such cases can obviously happen in principle, only when a contrast is continued by more than one line of evidence is it possible to securely privilege a particular reconstruction.

The second and third mechanisms however are poor patches to the problem: they end up as unfalsifiable “just-so phonology”. Both irregular sound change and paradigmatic levelling are singular events that can be only assumed, never defended in detail, and never clearly shown to be incorrect by additional data. Usually it also becomes nearly impossible to then establish the real proto-language starting point. For the former the main issue is one of directionality, especially for supposedly irregular correspondences widely across a family, but also since local archaisms are in principle possible. For the latter the typical problem has been treating “alternation” merely as a free-floating excuse to mix and match vowel reflexes, without giving it any original morphological or phonological distribution. Sometimes we may fall back to these, but they’re no more than band-aids for etymologies that otherwise seem to work and which we don’t feel like discarding for but a single irregular feature. (There are further similar mechanisms available to the historical linguist too, but they start to get outside phonology entirely. [5])

It is only the fourth approach that has real explanatory power for exception cases. Reliably established conditional sound changes allow accounting for the development of multiple words by a single explanation, reaching a more parsimonious historical scenario than anything built of one-off changes. Conditional sound changes make fairly exact predictions as well about what correspondences future etymological research may find. Though this should not be overstated: etymology is not a black box that feeds us experimental data, it’s made of scientists who are able to read work on historical phonology and might use it to hunt for new etymologies, in principle risking confirmation bias. New data rarely outright falsifies conditional sound changes either: more common responses in my impression are to either narrow down the conditioning further yet, or to seek explanations through relative chronology, so that apparent exceptions may turn out to be accountable as being due to counterfeeding sound changes.

As I’ve stated already in the intro slides to my CIFU 12 presentation: “one who seeks, shall find”. For several years now, a large proportion of new discoveries in Uralic historical phonology have precisely been conditional sound changes, either entirely new ones, or new and improved conditions for known sound correspondences. This includes also almost all results I am “sitting on”. Hence it seems evident that this represents a major underresearched area.

This is all the more surprizing since ample preliminary work has regardless already been done! With just a bit more rigor, many minor “sporadic” sound changes assumed by mid-20th-century researchers like Itkonen, Collinder or Rédei (to an extent also even 19th-century pioneers) could probably be transformed into more regular shape. This goes beyond the big names too: many minor articles may yet turn out to have the seeds of important insights, as maybe best exemplified by Lehtinen 1967; but also (staying still within Finnic) e.g. Bergsland 1968 as the inspiration for my idea of more general palatal unpacking *AĆ >*AjC, or the various loanword studies to have first discussed the idea of a sound change *ej > *ii. I have enough ideas already to put together a PhD from ideas I’ve already uncovered or developed on my own, but going onward from there, compiling and reassessing proposed sound changes from earlier research seems to me like an important desideratum for Uralic studies in the early 21st century.

[1] At least the Selkup development is perfectly explainable: there is no **pwë- in Proto-Selkup, and evidently diphthongization of Proto-Samoyedic *å to *wë was blocked after labials. Terentyev in СФУ 16 suggests *å > *o between a labial and a resonant (thus also e.g. PU *pončə ‘tail, back’ > PSmy *pånčə > PSk *ponč-ar ‘hem’, (? *parka >) *pårkå > *porqɨ ‘coat, parka’), *å > *u between a labial and an obstruent (thus also e.g. *mośkə- > *måsə- > *musɨ- ‘to wash’, *poskə > *påtə > *putɨ-la ‘cheek’). There are also cases of *å > *o/u not preceded by a labial though. I wonder if syllable closure and/or if PSmy *å goes back to PU *a or *o should also be taken into account.
[2] Most extensively in: Lehtisalo, T. 1933. “Zur geschichte des vokalismus der ersten silbe im uralischen vom qualitative standpunkt aus” [sic: no caps]. Finnisch-Ugrische Forschungen 21: 5–55.
[3] E.g. ‹i› when giving modern Hu. ë/ö is likely to have been a shorter/laxer *ɪ, while ‹i› when giving modern Hu. i/í is likely to have been longer/tenser *i ~ *iː, as can be confirmed by different Uralic sources for the two — and hence these correspondences do not involve “sporadic” lowering of †i, but rather quite regular lowering of */ɪ/.
[4] It is in theory however possible, given long enough phonological development, that many conditional sound changes bleed a proto-phoneme such as *a on its way to some default reflex *A in a sub-branch, and then this is bled by additional conditional sound changes in several environments including all the retention ones on its way to some modern reflex like /a/, that there aren’t actually any cases left at all where *a > /a/. In such a case all reflexes of original *a in this modern variety would be conditional one way or the other. An almost-example is the fate of PU *k in Tundra Nenets: when singleton palatalized to /sʲ/ before front vowels and (? backed-then-)lenited to /x/ before back vowels, in coda debuccalized to /ʔ/ as the first member, and almost always lost as the 2nd member of a cluster — so that the “default” development *k > /k/ is only really found in the original cluster *kk. On the average Uralic is still phonologically compact enough though that usually anything like this does not happen.
[5] One other common option is “find a root that does work phonologically, then go hog wild with semantics”. This has given us many such great etymologies as Kari Liukkonen’s infamous derivation of Finnic *noki ‘soot’ from Baltic *nagis ‘nail’, allegedly through an unattested sense ‘dirt under fingernails’ (I wish I were kidding). — When in need of a patch, I seem to tend towards phono-semantic contamination the most for some reason. Arguably this is also an underresearched area, but again, semantic change is singular and cannot be actually usefully reconstructed all by itself. At most it seems that we could collect examples and try to look for typological generalizations, hardly a project to have lasting impact very soon.

Tagged with: , , , , ,
Posted in Methodology, Reconstruction

Enter your email address to follow this blog and receive notifications of new posts by email.