Linkday #1: On computational phylogenetics

I think I’d like to have more content up on this site, despite being tied up with studies and life’s other little distractions from research. Showcasing some interesting articles might work for that, even when I don’t have detailed critique to offer myself on their topics. (There might be some drift away from exclusively Uralic topics, while I’m at that.)

For starters, I’ll bring up a post from the evolutionary linguistics blog Replicated Typo: Reconstructing linguistic phylogenies — a tautology?

This is a relatively old post (2011), and yet captures a number of problems that I keep seeing in computational typological studies.

Two comments, though:

  • The diffusability of sound changes across lineages means that no, establishing a sound correspondence is not quite the same thing as establishing phylogeny. After all, if reconstructing a proto-language automatically also generated a phylogenetic tree of its descendants, there would likely be no need for computational phylogenetic studies in the first place!
  • I’m not sure if I agree that identifying words as cognates presupposes that the languages they occur in are related. There’s a narrower and a wider sense of “cognate” out there: the first is indeed restricted to words related by common descent — but when we’re talking about more hypothetical relationships, the word can also mean “of the same origin thru some means, possibly but not necessarily involving borrowing”. A typical example would be the Uralic and Indo-European words for ‘water’ or ‘name’, for which there is a widespread consensus for some kind of a relationship, but two different camps on how they should be explained.
A morphophonological place avoidance effect in Finnish

I brought up Similar Place Avoidance (SPA) a couple of posts ago. Here is a neat case study of it in action, one that I have already noted quite some time ago.

An Introduction

The Finnic languages are usually considered to have no strict noun/adjective division, and adjectives are analyzed as the same part of speech as nouns. But this does not mean that there would be no visible differences between words that have regular nominal semantics (“substantives”, in the Finnish grammatical tradition) and those that have adjectival semantics. While there are a couple of underived, bare-root adjectives (e.g. kova ‘hard'; nuori ‘young’), most Finnic adjectives are marked with an ending that reveals their semantic function.

In Finnish one of the more common adjectival endings in “adjectival roots” is the suffix -ea, -eä (in the forthgoing marked together as -eA). This is in contrast to suffixes that derive adjectives from words that, as their bare root, function as nouns. E.g. punainen ‘red’, with the highly common adjectival ending -inen, derives from a separate noun puna ‘redness’, whereas valkea ‘white’ does not allow a synchronic morphological division into a self-standing root with one meaning + a suffix with another. [1]

An Issue

There is however a curious statistical gap in the distribution of -eA: it seems to shun preceding dental consonants. Using the Nykysuomen sanalista wordlist as a reference, there are no Finnish adjectives ending in -neA to be found, and no more than six ending in -teA, either:

  • kiinteä ‘solid’
  • kostea ‘moist’
  • lattea, litteä ‘flat’
  • nuortea ‘youthful’
  • pirteä ‘cheery’
  • reteä ‘chill’

A seventh could be implied in vetreä ‘spry’, which seems to come via metathesis (perhaps by influence of potra ‘thriving, brisk; usually only in potra poika‘?) from earlier *verteä. Compare verrytellä ‘to stretch, to flex’, which appears to share the same stem /vert-/. [2]

For comparison, we can count e.g. adjectives ending in /-peA/. There are more than twenty of these:

  • apea ‘sad’
  • hempeä ‘romantic’
  • hilpeä ‘jocular’
  • hulppea ‘extravagant’
  • kalpea ‘pale’
  • kapea ‘thin’
  • kepeä ‘light’
  • kipeä ‘sore’
  • kirpeä ‘sour, crisp’
  • kopea ‘arrogant’
  • leppeä ‘mild (of weather)’
  • nopea ‘fast’
  • nyrpeä ‘grumpy’
  • rapea ‘crisp, crunchy’
  • ripeä ‘prompt’
  • suopea ‘benevolent’
  • suppea ‘concise’
  • turpea ‘swollen’
  • tympeä ‘stale’
  • upea ‘fantastic’
  • ylpeä ‘proud’

/t/ is, overall, a more common consonant than /p/ in Finnish, so getting this kind of a result is not a priori expected.

I have also counted examples with other consonants. An uneven distribution clearly biased against dentals continues: there are e.g. on the order of 60 adjectives ending in -keA, and about 20 in -meA.

The result is however quite understandable in light of the SPA principle. The Finnish adjective ending -eA comes from earlier *-eðA < Proto-Finnic *-edA < Proto-Uralic *-ətA. This would be the easiest to demonstrate using the peripheral Finnic languages Veps and Livonian, which retain PF *d, but even some examples older yet can be found:

  • Fi. dialectal kalkea ‘hard’ ~ Moksha /kalgəda/ ‘id.’ < *këlkəta
  • Fi. tankea ‘stiff’ ~ Moksha /taŋgəda/ ‘id.’ < *tëŋkəta
  • Fi. oikea ‘right’ ~ Moksha /viďä/ ‘id.’ < #wɜjkəta
  • Fi. pimeä ‘dark’ ~ Komi /pemɨd/, Udmurt /peĺmɨt/ ‘id.’ ~ Proto-Samoyedic *pəjmətä ‘id.’ < #pid₂mətä

So we can expect SPA to have intervened, over the course of millennia, to somehow clean out any undesirable sequences of two syllables beginning with dental stops. Of course, the modern Finnish ending has no signs of a dental element anymore, and so we could perhaps hypothetize that the six or seven exceptions have been formed only after the loss of the segment. (Indeed, as far as I can tell, none of them have exact equivalents in any related language, and most are a somewhat limited even in their distribution across the Finnish dialects.)

An Analysis

But how has this worked exactly in practice? There is no shortage of Finnic word roots with medial -t-, and a fair number with medial -n- as well. Does this imply that words that once upon a time ended in *-tedA and *-nedA have changed to something else?

I believe the solution has been instead morphological. As an adjectival ending, -eA still has several competitors in Finnish, and every so often we can find sets of essentially synonymous adjectives (with only minor differences in register and tone) that differ only in what suffixes are employed. Examples that can be noted in modern Finnish include -Ut (as in kevyt ‘light’), -(A)kkA (as in kalvakka ‘pale’, rivakka ‘prompt’), and participles such as the past active -nUt (as in turvota ‘to swell’ → turvonnut ‘swollen’).

One suffix that comes particularly close to -eA in shape is the regular present active participle -(e)vA, also commonly repurposed for deriving adjectives. [3] OK, the preceding vowel is taken from the verb stem and is not a part of the suffix: mene- ‘to go’ → menevä ‘going; busy’, but osu- ‘to hit’ → osuva ‘hitting; apt’, or paina- ‘to press, to weigh’ → painava ‘pressing; heavy’. But the interesting part is the existence of a couple of adjectives that seemingly possess this ending, and yet are not derived from any known verb. Frequently they seem to derive from nominal stems instead. What is more, quite a few of these are both apparent e-stems, and have a preceding /t/:

  • etevä ‘skilled’ ← esi : ete- ‘fore-‘ (but not ‘to advance’)
  • harteva ‘wide-shouldered’ ~ hartia ‘shoulder’ (no verb ˣharte- ‘to be shouldered’ exists)
  • jäntevä ‘wiry, spry’ ~ jänne ‘sinew’ (no verb ˣjänte- ‘to be sinewy’ exists)
  • kalteva ‘slanted’ ← kalte- ‘side’ (but not ‘to slant’)
  • kätevä ‘handy, dexterous’ ← käsi : käte- ‘hand’ (but not ‘to do with hands’)
  • lehtevä ‘leafy’ ← lehti : lehte- ‘leaf’ (but not ‘to be leafy’)
  • luonteva ‘natural, easygoing’ ~ luonto ‘nature’ (no verb ˣluonte- ‘to be natural at’ exists)
  • ponteva ‘vigorous’ ← ponsi : ponte- ‘motion, exertion’ (but not ‘to exert oneself’)
  • roteva ‘robust’ (seemingly underived)
  • varteva ‘tall (of people)’ ← varsi : varte- ‘stem’ (but not ‘to be tall-bodied’)

I also have counted a couple of cases where this kind of suffixation seems to have taken place before non-dental consonants, but these are clearly rarer. There is only one debatable case with -pevA: lipevä ‘slick, unctuous’. The bare root lipe-, used as a base for a large number of words related to slipperyness, is not verbal, no, although there is the quite close-by verb lipeä- ‘to slip’ (its regular present active participle is lipeävä). With -kevA there are six cases (e.g. väkevä ‘strong’ ← väki : väke- ‘people’ < *’power’). So in the end, probably the extension of -(e)vA from a regular participle function to another adjectival ending has taken place here as well. But we can still see a clear discrepancy between the 10 : 7 ratio of -evA to -eA adjectives when the previous consonant is /t/; a 6 : 60 ratio when it is /k/; and a 1 : 21 ratio when it is /p/.


What have we seen, and are able to conclude, so far?

  • The Finnish adjectival ending -eA has been disproportionally rarely applied to stems that have a medial dental stop.
  • By contrast, the ending -(e)vA has been disproportionally often applied to stems that have a medial dental stop; and, arguably, disproportionally rarely to stems that have a medial labial stop.
  • These results support viewing Similar Place Avoidance as a potential statistical linguistic universal.
  • The ending -evA has probably been originally extracted from the participles of e-stem verbs.
  • This extraction may even have happened specifically to acquire an alternative for -eA.

…and skipping a bit further ahead of syllogistic step-by-step argumentation: the most general statement of what is going on here is that derivational morphology is not random. In a morphology-rich language, affix alternants and synonyms will form an “ecology” where potential words are selected for according to their adherence to some kind of aesthetics, such as phonetics-rooted criteria. SPA is one example of such a criterion. There are probably several other relatively general ones that could be identified crosslinguistically. And indeed, there are some other examples that I could illustrate as well.

Some further questions

As for this particular case study: so far I’ve only shown that one particular Finnish adjectival suffix has a non-random limitation on its occurrence; and identified only one other suffix that has been taking on the work of -eA. There would likely be others as well. Most of my -tevA examples were, in the end, derived words, based specifically on e-stem nouns. So what about primary adjectives? Or adjectives derived from A-stem or O-stem nouns? Do they perhaps also have their own specific preferred adjectival endings? I don’t quite have an answer yet.

Also, what about the other Uralic languages? How have they solved this issue? A couple of the adjectival stems on showcase here have cognates elsewhere in Finnic, too (e.g. kätevä being also found in Karelian). But since the adjectival suffix *-ətA dates already to Proto-Uralic, we can expect this particular problem to come up several times before as well. Could we find similar limitations in its distribution in e.g. the Mordvinic or Permic languages?

Obviously this question could also be extended to any other suffix type. Deverbal nouns? Frequentative verbs? Deminutives? There’s a lot that could be studied about derivational morphology across the Uralic languages.

[1] Historically, there may well exist a derivational relationship, though. Note e.g. valo ‘light (n.)’, vaalea ‘light (a.)’.
[2] The ultimate root for these could be veri ‘blood’, the implied derivation being thru an unattested (?) verb ˣver-tä- ‘to be full of blood’ > ‘to be energetic’?
[3] This even has develop’d historically by a similar intervocalic lenition from *-βA < *-bA < *-pA, so at almost any given stage of Finnish prehistory stage it would have been the exact [+labial] counterpart of the [+coronal] *-ətA.

Some things rotten in the history of Tungusic

On a whim, I’ve started to investigate the lexicon of Proto-Tungusic, which the Moscow school of Nostraticists maintain a handy database of (as they do for pretty much all Eurasian language families).

I am currently about 10% in, having looked thru (and transferred into a spreadsheet for further analysis) all roots beginning with *a, *ā, *b and maybe half of *č. Interestingly though, there are already a couple of clear signs that the analysis is not exactly reliable, even without me knowing anything about any Tungusic language in detail. In some aspects things appear to be even, quite simply, terribly wrong.

In particular, one obvious argument stands out against the Altaic hypothesis, at least in the strongest form as advanced by the authors: around 98% of the words in the database (so far, 235 out of 240) are traced in some form back to Proto-Altaic.

So, Tungusic, despite being a family bordering unrelated languages on several sides (Nivkh, Chukotko-Kamchatkan, Yukaghir, Sinitic), and distinct enough also from its supposed relatives that no generally accepted protolanguage has been so far reconstructed — is regardless supposed to contain less than 3% non-inherited material in its reconstructible vocabulary! All substrate loans, all proto-language era loans, all areally widespread loans, all coinages, all onomatopoeias, all words that have semantically diverged so far that their ancestry has become opaque: these categories are supposed to wholly fit among the five allegedly non-Altaic word roots I have down so far. I guess the people responsible for this project haven’t grasped the idea that there even is a typology of etymology that they are violating.

Now, sure, if “Proto-Altaic” was brought forward as a synchronic grab-bag of word roots and typological features that are just found in some shape across a wide area in central to northeastern Eurasia, this all would not necessarily be a problem. We’d just call it a work in progress, and hope for eventually sorting out which words indicate Mongolic loans in Tungusic, which ones Para-Tungusic loans in Korean, which a mutual substrate in Turkic and Tungusic, etc. Yet as far as I can tell, no self-professed Altaicist takes this stance.

It somehow gets worse from here yet. A preposterous amount of the time, words are reconstructed to Proto-Tungusic on the basis of only a single language, plus external parallels. The typical language in the family seems to retain about 40-60% of the original vocabulary (you may wish to compare this against the previous number). If the vocabulary had later only been subject to random loss, we’d expect that words surfacing only in one language (out of ten, as per the database’s analysis: Evenki, Even, Negidal, Manchu, Ulcha, Orok, Nanai, Oroch, Udighe, Solon) occurred about 0.5^10 ≈ 0.1 % of the time. Guess how many actual cases the current sample includes? 34, ie. about 14%. An additional 17 roots (~7%) are then limited to a single sub-branch of the family, e.g. Northern Tungusic.

This kind of a discrepancy might still be excusable, if this were a Turkic-type situation — a family where one of the main branches is currently only represented by a single language. In such a case, any word that had been lost in the “main” branch could well have been still retained in the “minor” branch. But that won’t work here: the isolated vocabulary is scattered over several languages, including especially both far ends of the family (Evenki in the north, Manchu in the south), and occasional cases from most other languages as well.

Whether there are issues in the actual raw lexical data though, I couldn’t tell, but it’s cited from a decent variety of sources… so at least there should be no reason to suspect a systematic heterodox methodological bias.

Of course, knowing that there is a problem is not equivalent to knowing how it should be fixed. The latter will take a bit more work than a single blog post, I am sure. One path would be the traditional etymological approach: to just wade in and start noting comparisons that are phonetically or semantically dubious, and see how much that takes care of. But, there are other options as well that might turn out more effective. E.g. zooming in on material that phonologically stands out (possible loanword phonemes and similar features) would perhaps lead to something. I moreover have in mind, one step more quantitative yet, a relatively simple statistical check-up: correlating the internal Tungusic distribution of the word roots to the external distribution of their Altaic parallels. E.g. if a substantial number of loans to/from Mongolic have been here misinterpreted as inherited, I’d expect a language such as Manchu (neighboring Mongolia) to contain more of these than a language such as Negidal (by the Sea of Okhotsk coast)? We’ll see. I will have to do a separate sweep of the Altaic database later to log this info, and I still have quite a while to go here as well.

On the epistemology of sound change, part 1

Continuing from the last post, and toning the meta-ness of the discussion down just a little…
What does it, at the level of everyday research, mean for me to request “justification on the basis of more elementary phenomena” for the concepts of historical linguistics? Say, from the viewpoint of sound change?

The foundations of the concept

The concept of sound change is already implicit in the concept of cognate words. If we assert that a word such as Hungarian ősz ‘autumn’ is cognate to Finnish syksy ‘id.’ (I will not go into unpacking what it means for a “language” to have a “word” that is expressible by a string of letters, although these are good questions to ask as well) — then this means that at one time, a common proto-form of the words existed. The self-contained apparatus of historical linguistics can also produce a graphical representation of this; according to my chosen system of Proto-Uralic reconstruction, it will be *sükśə or perhaps *sükəś. We may also propose a phonetical value for this. Indeed, most linguistic transcription systems already do this implicitly. Despite occasional use of cover symbols for difficult-to-reconstruct segments such as Proto-Uralic *d₂, or in words whose precise phonemic content cannot be resolved from the available evidence, I cannot say I have ever seen a purely abstract presentation of a proto-language.

Usually, the suggested pronunciation of the word will end up different from the real attested words on whose basis it was posited, in at least some respects. This thus requires that changes in pronunciation, i.e. sound changes, have occurred at some point in the evolution of Hungarian, Finnish, and any other Uralic languages. In this particular case, we have the loss of palatalization, *ś > *s, in both Finnish and Hungarian; the loss of *k, plain *s, and the second-syllable vowel entirely in Hungarian; the lowering of *ü to *ö in Hungarian; and its acquisition of length. It would be possible to shuffle some of these changes around (e.g. perhaps it is not Hungarian that has lost a /k/, but Finnish that has gained one? perhaps originally a third type of sound yet occurred here?), but the fact that ősz and syksy are not identical in their pronunciation will remain.

In clearer words yet: that sound change somehow, for some reason happens is clear already from the idea that etymologies exist; that non-identical words can have a common origin. Interestingly, note also that relationships existing between entire languages is not a required assumption at this point.

By the way, note that I do not claim this to be the actual history of how the concept of sound change was developed (that story is much more complex yet). This is only an observation on the internal logical structure of the modern-day theory of historical linguistics.

There is thus “downwards inference” involved here. Instead of tinkering with empirical research on articulation & such and discovering that a certain series of events can add up to the large-scale phenomenon of sound change, we have looked at higher-level data yet and found patterns that can be effectively explained by assuming the existence of sound change — despite not yet knowing the first thing about how it works. As a scientific theory, this is adequate insofar as it can still provide predictions, but naturally it leaves us asking: what exactly are these sound changes? Can we actually see one happening somewhere? Could a detailed understanding of them enhance our understanding of etymology as well?

Motivating the instances

There do exist disciplines like phonetics and sociolinguistics that are directly tackling the questions of the evolution of language on the scale of years, weeks, milliseconds rather than generations. However, the theory of sound change can be further sharpened already by more careful investigation of etymology.

There is of course also the tiny snag that detailed sociolinguistic data or phonetic records do not exist for the pre-modern histories of languages (let alone the entirety of prehistory). So we are mostly unable to directly observe sound changes in their full historical context, and indirect inference from etymological data remains our almost sole option for finding out about them. This means that care is required to not drift off to pure speculation.

What, then, is sufficient evidence for assuming a specific sound change to have occurred?

The “naïve method“, I could call it, is to simply indiscriminately collect sound correspondences that “seem to” exist between some given language varieties, claim as cognate any pairs of words that can be linked by some application of these, and then present some sound changes that can account for the correspondences. This has been, and continues to be, used as the usual first step in investigating the sound correspondences within a group of “obviously related” language varieties, such as dialects of a single Language. (We still do not need to take a stance on whether “non-obvious relationships” can exist between languages.) Or, if we’re investigating loanword etymology, we might look at any pair of languages we believe to have been once somehow involved with each other.

Back to the previous Finnish vs. Hungarian example, e.g. two original s-ish consonants can be assumed right off the bat, that we could preliminarily call *s₁: defined as becoming /s/ in Finnish, vs. zero in Hungarian; and *s₂: defined as becoming /s/ in Finnish and Hungarian both.

I will skip for now the wider problem of how to determine what segment exactly corresponds to what. For Uralic languages this is known to be, 99% of the time, a simple task: an initial stressed syllable corresponds to an initial stressed syllable, consonants correspond to consonants, vowels correspond to vowels.

The naïve method is much too powerful though, and let to run on its own, will inevitably lead to an an unfalsifiable system where anything can be related to anything else. This is because under it, word-level and and sound-level relatedness are translative. If we claim that sz in Hungarian ősz corresponds to the 2nd s in Finnish syksy, then it follows that the words are related in general, and that also Hungarian ő corresponds to Finnish y. If this is taken as an excuse to now relate any word that has y in Finnish to any word that has ő in Hungarian, and so on forth — this allows eventually racking up a correspondence library that allows relating everything to everything else.

There are, in principle, two ways of avoiding the problem. The first is a purely statistical approach: if two words don’t share at least some proportion X of known sound correspondences, we do not accept the comparison and do not accept any new sound correspondences that it would imply. This algorithm requires a “seed” of correspondences though — if you sic it on languages of which you know nothing, it will detect no related words, what with no correspondences being accepted yet. A “seed” must be instead generated by some other method. Likely ideas for this might be:

  • Having to build up a set of word comparisons that is closed with respect to sound correspondences, and where every correspondence occurs at least n times. An n = 2 example might be Finnish kala, pala, kesä, pesä ~ Northern Sami guolli, buolli, geassi, beassi (‘fish’, ‘bit’, ‘summer’, ‘nest’).
  • A set of correspondences that occur highly often and/or are between identical segments.

These seed methods, I believe, probably won’t manage to uncover everything that can be uncovered all by themselves, but let’s leave a closer analysis of their pros and cons for some other time.

A more interesting point is that these methods, phrased solely in terms of sound correspondences, are mainly focused on binary comparison. Correspondence-counting of any kind however runs into some rather nasty mathematical problems when a larger number of language varieties is involved. Consider for example: should a correspondence set t ~ t ~ t ~ t ~ θ ~ t be counted as a completely different entity from a correspondence set t ~ t ~ t ~ t ~ t ~ t?

  • If yes, we hit what is called the curse of dimensionality. Say we have two languages with 20 consonants each: there are then 20² = 400 possible correspondences between these, and we can well expect a decent bundle of etymological data to not only highlight which of these correspondences are highly recurring, but also which are noticably rare or absent. But if we rather have as few as six languages, the count of possible correspondence sets becomes 20⁶ = 64,000,000. The lexical stock of even the best-documented languages only reaches a fraction of this, and a typical etymological data set is unlikely to exceed a couple of thousand words. Given a space of millions of possible correspondences, most data points will perhaps cluster at some stable points, or in their vicinity. Any correspondence sets that turns up outside of these islands (say, an apparent correspondence h ~ t ~ t ~ d ~ s ~ z) will be difficult to assess. Also, by far most possible correspondence sets will be entirely absent, and we’ll have no chances of telling if the absense of one particular correspondence carries any statistical significance.
  • And if no — if we treat comparison sets as built from pairwise correspondences — then the transitivity problem pops up again. If we find a correspondence t ~ t ~ d ~ d ~ d ~ d, and we already know the existence of a correspondence t ~ t between the first two languages, can we really count it as evidence for the unity of this larger correspondence set in general?
  • And what of incomplete correspondence sets? Suppose we have p ~ b between languages 1 and 2; b ~ v between languages 2 and 3; p ~ v between languages 1 and 3. Can we really take this as sufficient evidence to unite them to a single correspondence set p ~ b ~ v? What if a correspondence p ~ p between 1 and 3 exists as well?

Instead of getting stuck on fine-tuning these problems, it’s however possible to change gears. There is a second, fundamentally different method possible as well: the chronological approach, whose nature I will be elaborating in the next post of this series.

Indo-Iranisms galore?

Currently I am making my way through a fascinating and peculiar book: Hartmut Katz’s posthumously released Studien zu den älteren indoiranischen Lehnwörtern in den uralischen Sprachen (Heidelberg: Universitätsverlag C. Winter, 2003).

Fascinating, in that the book’s ~700 loan etymologies, some of them providing novel and quite believable solutions to various etymological problems, are obviously much food for thought.

Yet also peculiar, in that Katz’s goal seems to have been to ascribe Indo-Iranian origin to as many words as remotely possible; while operating with a bizarrely dated framework of Uralic history, cut off from modern research. His reconstructed Proto-Uralic appears to be based mainly on the work of Wolfgang Steinitz around the 1940s, and views newer than the 60s are just about not even cited.

It would not be too hard to spend a while discussing the shortcomings of his reconstruction scheme in detail, but I suspect Katz’s ghost will not be appearing to defend, amend or recant his views. So for now, to simply note some strange ideas included, he posits e.g.

  • Mobile stress for old stages of Uralic, on the leftmost non-reduced vowel in the word.
    — To be fair, this is a system that can be indeed found in Uralic languages of the Volga-Kama area, and some hints of it can be seen in Hungarian and Mansi as well. But the complete absense of evidence for it in Samic and Finnic on one hand, and most of Samoyedic on the other, definitely suggests a secondary areal innovation. Katz fails to match this stress system with Indo-Iranian either.
  • A relatively extended system of “ablaut” that seems to be essentially projected backwards from Khanty (whose vowel alternations are usually thought to be umlaut anyway), and then employed to explain some exceptions that fail to fit his vocalism framework.
  • A reconstruction of Proto-Permic that contrasts eight labial mid vowels: short and long *ɔ, *o, *ɔ̇ [ɞ], *ȯ [ɵ] — yet has no plain *a of any length.
  • Four different sibilant series, with the new fourth one generated by splitting standard PU *ś in three. Both splits are established on the basis of evidence from a single language (Mator for *ć versus *ś; Mansi for *ś versus *š´). He still seems content to consider the traditionally similarly problematic retroflexion contrast in Khanty (*n *l versus *ɳ *ɭ) as entirely “affective”, however.

Some of the alleged sound substitutions in loanwords are puzzling as well. One claim that I am definitely not buying offhand is that PIE labiovelars could have been substituted by plain labials — sometimes, even, with different ways in the same word. For example, Katz attempts to derive both Finnic *pöörä ‘wheel’ and *käkrä ‘curved’ from pre-II *kʷekʷra- ‘wheel’. The latter strikes me as at least possible (there is moreover *kekri ‘year, yearly feast’ which has been explained from the same source as well); the former as requiring rather too many phonological assumptions, despite the seemingly straightforward semantics. Especially since the usual PU reconstruction: *peŋər- ‘to turn, to rotate’, establishable by comparison with the Ob-Ugric cognates, seems unproblematic.

At other times incredibly archaic PIE sound values are claimed preserved in Uralic. E.g. *e reflected as a front vowel even when adjacent to *h₂; *ḱ reflected as *k. It is unclear to me on what grounds these could be analyzed as specifically Indo-Iranian loans. If anything, such etymologies (in case they’re not illusory) might rather speak for loanword transmission through Late Proto-Indo-European, before diversification; from the adjacent IE branches, say Baltic or Tocharian; or some lost Indo-European languages, perhaps intermediate between Balto-Slavic and Indo-Iranian.

—Or even languages from different families entirely. I suspect taking e.g. Turkic better into account might help for resolving some oddities. There is, for example, some evidence for a correspondence II *bʰ : Uralic *m, which Katz claims would have been a phonetic accommodation to retain the voicing of the original consonant (rather a priori suspicious, since there was no contrastive voicing in early Uralic). At least one of the cases might allow for a different explanation: this is the Ob-Ugric word he reconstruct as *māŋkɜ, meaning ‘hammer’. He compares here Indo-Aryan *bʰangá- ‘to break’. But there is also Turkic *böŋk- ‘to kick, to buck’, with reflexes such Uyghur /möŋkü/-, Tuva /mög-/, with the assimilation *b-N > *m-N. This seems like at least a plausible intermediate for reaching the Ob-Ugric words (though accounting for the development of the semantics and vocalism would take some extra work).

Another problem should be a warning sign even for modern-day researchers. Katz’s scheme of separating the Uralic etymological material into about four stages of development — Uralic; Finno-Ugric; Finno-Permic vs. Ugric; Permic vs. Ob-Ugric — appears to not produce any particular benefits. Under his framework of historical phonology, there are essentially no differences to be found between the three early stages, only a number of more or less trivial rewriting rules such as “Finno-Ugric” *ə̑ > “Finno-Permic” *ă. The vast majority of the proposed loanwords fail to show support for the dating in their distribution either. “Proto-Uralic” words frequently turn up in Samoyedic and nowhere else (a few even, within Samoyedic, in Selkup only!); “Finno-Ugric” ones similarly in only a single sub-branch such as Mari, Finnic or Hungarian; etc. Chronological paradoxes arise too, when e.g. some words are posited to have been loaned before the Indo-Iranian shift *l > *r, yet during a later era such as “Ugric”, while others are posited to have been loaned after the change, yet during an earlier era such as “Finno-Ugric”. In other words, the traditional taxonomy of Uralic is here treated as a fact that has been given ex cathedra, and not critically engaged at all.

There is still one particular insight that I am happy to see appearing. Quite a few words are present in the data seemingly as etymological doublets (often even triplets, sometimes as much as sextuplets!) across the different Uralic languages. When this happens, Katz does not insist on shoving them under a single Uralic proto-form and deriving all the forms as “variants” or by “sporadic sound changes” (though these mechanisms are still a minor part of his toolbox). Instead he is on board with concluding that the one and the same word can have been loaned several times — as an areal rather than a genetic innovation — and perhaps in different shapes in different proto-dialects. If taken to its full conclusion (comparative reconstruction cannot be based on loanwords), I believe this method seems even likely to be able to resolve various lingering problems of reconstruction. Not that that’s quite happening yet in the book. I might cover some exemplary cases in detail in future posts, though.

At any rate, for now this remains one messy bunch of comparisons. Yet clearly on a valid topic, deserving of critical treatment. Perhaps one day we will see a more fruitful analysis of this corpus.

The rooting of historical linguistics

Most of the harder problems in the methodology of historical linguistics seem to come from it being a fairly “high-order” discipline, and a relatively isolated one at that.

To an extent, this true of all humanities. With the levels of computational power currently available to us, it’s not possible to start with a couple of known physical laws and derive exact predictions about human behavior from them. The best we can do along these lines is to establish boundary conditions. And of course, most of these are sufficiently obvious from our daily experience as humans that they sound more banal than profound when spelled out: e.g. language families usually have a distribution limited to the surface of the planet, and they fail to extend up to the stratosphere, or down to the oceanic crust. :ɪ

But the historical angle complicates things. Most of the historical sciences rely heavily on evidence preserved from the past: history itself is based on written sources, and the “auxiliary historical sciences” such as archaeology on other objects preserved from the past.

And yes, historical linguistics also builds on preserved evidence from the past, mainly via philology and epigraphy. But this has only been a small initial inspiration. Most of our historical insights are instead derived from from the observation and analysis of attested modern languages, and the application of a general theory of linguistic evolution. This model is, I think, quite alien to all other humanities. (Even plenty of non-historical lines of humanities research seem to remain stuck in a pre-scientific “there are no theories, only paradigms of discourse” mire.) In this sense historical linguistics has much more in common with evolutionary biology, although I suspect that also that discipline would not be doing as well as it is without the more direct evidence from an extensive fossil record. [1]

The inevitable implication is that nothing in historical linguistics can be understood without a good grasp of the underlying theory. And yet, it seems to me that many of its premises have not often been even stated aloud. No dout this is due to how the theoretical foundation seems to have been developed on a need-to-know basis by its users, as the discipline has expanded, not by any separate class of theoreticians. Yes, starting from the Neogrammarians, many of the surface phenomena have been described, from old’uns like “regularity of sound laws” to innumerable newer achievements like “typology of semantic change in body part terminology”… but the nuts and bolts of it, that really “root” historical linguistics to its sociolinguistic foundations, not so much. There has been so much work in cataloguing the “whats” that we have had not much success yet in uncovering the “whys”.

I am not sure if my term “root” is readily understandable, or if there might be a better term available. It seems likely that this could be confused with a discipline’s internal history, at least. Which is not what I mean: I refer here to by how various sciences can be ordered in how far removed they are from the basic laws of the universe. The typical example being how all biological processes can be broken down to individual biochemical processes; all biochemical processes to chemical ones; all chemical processes to particle physical processes. The reason that biology looks very different from chemistry, or from particle physics, is that studying the behavior of macroscopic masses of particles requires very different methods from studying 10 or even 1000 of them. A phenomenon such as embryonic development could in principle be modelled in terms of individual protons and electrons, but this would require enormous amounts of efforts wasted on reiterating problems like “how does a water molecule hold together” or “what happens to a protein when it encounters a water molecule”, that have already been solved to sufficient precision for us to instead model an embryo as being built from cells that are built from cellular organelles that are built from macromolecules. A biologist — or a geologist, or a cosmologist — is not interested in the whereabouts of individual particles, but rather in their patterns of distribution at a specific scale in space and time.

The same exact principle holds for the humanities. Say, all psychology is at a certain fundamental level about neurons; but in analysing the overall behavior of the brain, built from a hundred billion neurons, the beliefs, feelings, etc. that they encode can (and must) be treated as entities in their own sake. And similarly, while the speech of one human can be studied by phonetics, neurolinguistics, and similar disciplines, it again takes different tools to study the speech of a hundred million humans sprinkled across five thousand years. We need concepts such as “isoglosses” and “etymologies” that exist only as generalizations about the idiolects of individual speakers.

Our tools, however, do not seem to decompose easily into insights about smaller and smaller groups. How exactly does sociolinguistic variation in speech end up producing clean and neat sound laws, or patterns of loanword dispersal, or language areas sharing grammatical features? I do not think we have much more than loose guesses about the workings of these processes, so far.

This type of disconnect is, of course, quite common at the biology/humanities interface, and can be sometimes found elsewhere as well (e.g. in the absense of a working theory of quantum gravity). But to see it within a single discipline — linguistics — seems to me like a situation that ought to be resolvable.

This also means that historical linguistics knowledge rests, to an extent, on questionable ground. If we do not name our implicit starting assumptions, and end up making little effort to justify them on the basis of the more elementary phenomena they emerge from, is there not a risk that our edifice of knowledge stands askew, and ends up being an excercise in the construction of an essentially abstract theory, rather than a real description of the past?

Some philosophers would at this point certainly retort that all historical inquiry, being both unverifiable and unfalsifiable in the absense of a time machine, does not exist for the purpose of creating a real description of the past, but to create compelling stories about it. OK, I say, but some of us happen to consider truth an essential component of what makes a story “compelling”. Moreover… any model of the past will also make predictions about some parts of the present that we have not examined yet, which grants all historic theories a limited degree of falsifiability.

I do not claim to have a dossier of answers to issues of this sort prepared. Perhaps one or two sketches of solutions. But, of course, questions have to be asked before they can be even begun to be answered.

[1] Arguably though one could claim that the majority of our planet’s biodiversity exists at the microscopic level, and that most of biologial history must be thus similarly approached via comparative reconstruction. But in my understanding this is a relatively new approach in evolutionary biology; while historical linguistics dove headfirst into reconstruction already back in the 19th century.

Some sunny words

A recent blog post from Christopher Culver brings to my attention an apparent family of Turkic word roots showing irregular variation in form: *künäš ~ *qujaš ‘sun, day, heat’. Aside from the alternation *n ~ *j (for which *ń seems to be a standard explanation), these seem to make up a neat pair of front/back variants.

I am wondering however if this relationship might be illusory, and if there might be an old Uralic loanword in Turkic involved here instead. There are a few Uralic word roots (themselves probably in some sort of an obscured correlative relationship) that seem quite relevant here:

  • *kaja ‘sun, to shine’ (> Finnic *kajasta- ‘to dawn, to shine’, Lule Sami guojijdit ‘to rise (of sun or moon)’, Samoyedic *kåjå ‘sun’, etc.)
  • *kojə ‘dawn’ (> Finnic *koi ‘dawn’, Hungarian hajnal ‘dawn’, Mansi *kuj ‘dawn’, etc.)

Of particular interest is the Hungarian word, which seems to show the exact same “suffixal” elements as Turkic. This even has a formal equivalent in Khanty: *kuuńəɬ´ ‘dawn’ (apparently showing a change *jn > *ń, in neat parallel to the change *jt > *ć that was proposed by Aikio recently [1]), coming closer yet to the Proto-Turkic form.

It’s hard to say though what the dangling element -nal is here. It’s neither an independent word root on its own, nor a regular derivational affix. If I had to speculate, a compound *kojə-n‿alŋV > *kojnal- ‘beginning of sun’ could be assembled… but this seems a bit contrived semantically. Also I am not convinced if Khanty *aaLəŋ ~ Mansi *aaɣəl ‘beginning, end, point’ is an inherited root at all. [2]

And while phonetically the Khanty form in particular seems like a prime loan original, the semantics are a bit off. Is the meaning ‘dawn’ in Hungarian and Khanty perhaps secondary, from earlier ‘sun’ or the like? Or was there instead a shift ‘dawn’ > ‘sun’ in some transmission language along the way?

Some Turkologists, I’m sure, could also see it as an obstacle that this etymology seemingly requires adhering to sigmatism (reconstructing a Proto-Turkic lateral” *l₂ that later shifts to *š in Common Turkic) over lambdaism (reconstructing PT *š that shifts later to *l in Oghur Turkic). Now, yes, from what evidence I’ve seen, I lean on the view that sigmatism is the better solution [3]… But it is, however, not an entirely inescapable assuption here. Say we instead assumed that early Oghur maintaines *l₂ for some time apart from *l (perhaps indeed as a lateral fricative [ɬ]? [4]) Then we could posit an etymological sound substitution to have occurred during propagation to the other Turkic languages: Khanty *kuuńəɬ´ → Oghur #qujal₂ → Common Turkic *qujaš.

Independent loaning to different Turkic varieties might also be chronologically preferrable to assuming loaning already to unitary Proto-Turkic. Christopher notes that *qujaš seems to have a kind of northerly-leaning distribution across the Turkic languages… not bad news for an attempted Uralic loan etymology, I’m sure.

[1] Aikio, Ante (2014): Studies in Uralic etymology II: Finnic etymologies. Linguistica Uralica 50:1.
[2] There is, yes, a rather similar word root in Finnic: *alka- ‘to begin’ — but this does not quite correspond regularly to the Ob-Ugric words, esp. on account of the discrepancy between *ŋ and *k. The vowel correspondence Kh *aa ~ Ms *aa is not typical of inherited Uralic vocabulary either.
[3] But note that this does not compel me to take a stance on the similar rhotacism/zetacism debate, nor to consider *l₂ of “Altaic” inheritance.
[4] Which even brings to mind the East Uralic shift *š > *ɬ, rather similar to the shift *š > *l posited by the lambdaist side of the Turkic debate.

Similar Place Avoidance in language history

An interesting paper I’ve found a couple days ago: Pozdniakov, Konstantin & Segerer, Guillaume (2007). Similar Place Avoidance: A Statistical Universal. In: Linguistic Typology 11:2.

The main thesis is relatively simple: most languages of the world disfavor word roots where the word-initial and word-medial consonants have the same place of articulation; and, more generally, word roots combining two peripheral (labial, velar) or two “central” (dental, alveolar, postalveolar, retroflex, palatal) [1] consonants.

I have also independently discovered this principle some time ago in my exploration of statistical properties of phonotactics in the Uralic languages. Unlike P&S, though, my first reaction was not to assume status as a defining characteristic of Uralic in general. Certainly its occurrence in well-separated branches of the family seems to require its occurrence in Proto-Uralic as well… but who knows how much further back does it go? I do not recall seeing very many word roots shaped anything like √kag- or √bomp- in almost any Eurasian language at all, really. I have had an impression they’d be slightly more common in some Niger-Congo languages — but apparently not. (Seeing what the results are for Japanese might be also interesting; the language seems to be quite rife with words like tatami, tsunami, kami, fuku, fugu. But I am not sure if Internet Japanese™ constitutes a representative sample.)

Some further observations on the topic:

The maintenance of SPA

A question that I did not see covered in the paper is that the maintenance of SPA in languages requires a degree of diachronic stability of consonant POA classes. Now indeed, as a first approximation, while fluctuations between different types of e.g. coronals (ts > tθ > θ > ð > d > l > …) or velars (k > x > ɣ > g >…) are commonplace sound changes, it’s much rarer to see consonant evolutions such as *p >> *d or *d >> *x.

But the boundaries are still not impermeable. Quite a few relatively general sound changes are known across the world [2] that convert consonants from peripherals to centrals, or vice versa:

  • Labial > palatal: e.g. *w > *j in Hebrew
  • Coronal > labial: e.g. *θ, *ð > /f/, /v/ in Latin and the other Italic languages; similarly *t > *θ > /f/ in Rotuman
  • Coronal > velar (or uvular): e.g. *š > *x in Finnic, Spanish, Pashto…; *t > *k in Oceanic languages such as Samoan, Hawai’ian; *r > ʀ/ʁ in continental Western Europe; *ɫ >ʁ in Armenian
  • Velar > palatal: *k, *g > *c, *ɟ > tɕ, dʑ — a frequent change: Satemic IE, Romance, etc.

This raises the question of how the strength of SPA evolves in languages. Changes of the above sort, applied to a language that follows SPA, will necessarily decrease its SPA-compliance. If *š frequently co-occurs with velars, and rarely co-occurs with coronals, then a change *š > *x will introduce a larger number of velar-velar roots than velar-coronal roots. It follows that there must also exist some mechanisms that increase the SPA-compliance of a language.

A naive assumption that P&S summarily dispatch would be sound changes running in the opposite direction: place dissimilation to re-establish SPA, a la (? *kaša >) *kaxa > *taxa. Yet this is not a commonly attested type of change at all (the only example I can think of of is *t > *k only when a 2nd *t follows; attested IIRC from one of the Oceanic languages [3]), and it clearly cannot be a relevant factor.

My hypothesis is that lexical loss is not random. Suppose a language had two synonyms /maba/ and /suba/ for expressing a given concept; then over time, as the language splits into descendants, SPA-violating /maba/ would be more likely to be lost than the SPA-compliant /suba/. A motivation for this could be that SPA-violating roots are generally found to be more “childish” or “non-serious” in sound, and that they’d be more likely to go “out of fashion”. (Pop quiz: which of the sets {boob, dude, google}; {duty, goop, boogie}; {duke, good, butte} do you find the funniest-sounding?)

This is, in principle, a testable proposition. Take for example the interdental > labial shift in Latin. I would predict that PIE roots that display the change *dʰ >> /f/ ~ /b/ are more likely to be lost or marginalized in Latin (both in early Latin and later on in Vulgar Latin) when there is an original labial or velar consonant in the root as well. Or, in the other direction: I would predict to be able to trace the ancestry of words like duke, on average, a longer way back than that of words like dude.

Affricate co-occurrence

P&S further divide the SPA principle into a couple statements of different strength. The “general” version is that peripherals avoid any other peripherals, and centrals any other centrals; while the “strict” version is the rarity of, especially, word roots with two consonants of the same exact POA. They discover, however, one major divergence from even the last: the Bantu languages apparently feature a high number of word roots with two palatal consonants. I’d guess this represents an assimilation development of some sort. Perhaps the palatal series represents the merger of former palatalized alveolar and palatalized velar series? This relatively frequent development would easily leverage the apparently universal abundance of TK and KT roots to produce instead an abundance of CC roots.

— In Uralic we find no evidence for an especially strong co-occurrence of palatals. However, the postalveolar affricate *č has a strong tendency to “repeat”. There is a remarkable number of  old Uralic roots (some of these more, some less secure) such as:

  • *čača- ‘to be born’
  • *čača- ‘to walk’
  • #čEnčä ‘back’ ~ ‘tail’?
  • *čänčä ‘goose, duck’ (from Baltic *džans- < PIE *ǵʰans-)
  • *čëčə ‘duck’ (perhaps also from the above PIE root somehow)
  • #čečə(kä) ‘moment’
  • #či(n)čä ‘little bird’
  • *čoča- ‘to sweep’
  • *čo(n)čə ‘netstring’
  • *čučkə ‘block of wood’

Perhaps a partial explanation would be some sort of consonant assimilation phenomena. At least the 3rd word seems to have involved an assimilation *č-s > *č-č. And a couple of these roots are reflected in Finnic and Samic as if coming from original *ć-č  — yet not all, as shown by Finnic *häntä ‘tail’, *hetki ‘moment’, Samic *cōccë ‘netstring’ (provided the Uralic etymologies for these are valid: they all involve some irregularities). And maybe the “dissimilating” roots should hence be similarly reconstructed as dissimilar to begin with.

We could also wonder if this should be taken as evidence for an origin of some cases of *č via palatalization from earlier velars.

…and other reduplications

P&S also find, though, that at least some languages can have a tendency to favor “reduplicated” roots (their example is Wolof), with the exact same consonant in the root-initial and root-medial positions. Obviously in a language with several consonants per POA, this effect will be overshadowed by the numerous other combinations possible — so /b-b/ could end up relatively frequent, but cases like /b-m/, /b-v/, /b-f/, /b-ɓ/ etc. will still remain rare.

From my initial observations, though, this does not keep up in Uralic, where classes like “labials” are frequently limited to only a single obstruent *p, the nasal *m and the glide *w or *v. The Proto-Sami lexicon, [4] for example, contains less than two dozen PP roots, and most of them are either of the shape *m-v, *p-v; *v-m, *v-p; or *v-v. There is only one root of the shape *p-m; none of *p-p, *m-m or *m-p.

The occurring cases incidentally can be shown to be in large part secondary innovations. E.g. the 2nd class contains *vāpsē ‘blade of mitten'; *vipsë ‘skein'; *vēpsēs ‘wasp'; *vōmë ‘width'; *vōmē ‘woods'; *vōmā- ‘to notice'; *vōmtë *body cavity'; *vōmtē- ‘to sell'; *vōpējē ‘narrow bay'; *vōpērēs ‘three-year-old reindeer bull'; *vōppë ‘father-in-law'; *vōppō- ‘to pluck'; *vōpsë ‘mesh in a fishtrap'; *vōptë ‘hair. Most of the roots here seem to have involved the PS development *a-, *o- >> *vō-. All the rest involve the cluster *-ps-, though I’m not sure what to make of that fact.

Cluster complications

Another question the paper does not address is how should one analyze heterorganic consonant clusters. Most languages of the world prefer a simple CVCV syllable shape over CVCCV. The latter type is regardless fairly popular in some languages. E.g. my index of the Proto-Sami lexicon contains about 920 roots with clusters, about 600 without. So are clusters to be counted as “medial consonant preceded by a coda”, or as “medial consonant followed by another medial consonant”? Is a word such as PS *tolkē ‘feather’ more or less SPA-compliant than PS *kōlkë ‘hair’? The second does have a neat alternating POA structure; but both the syllable onsets are velars. Which of these is more relevant?

From a preliminary look, it stands out that the relative frequencies of 1st members of clusters resemble quite closely the relative frequencies of single medial consonants — while the relative frequencies of 2nd members of clusters closer resemble the relative frequencies of onset consonants. This would seem to suggest that we should indeed be comparing the first two consonants. But the details could fare differently.

Let’s take a sneak peek at velar/velar combinations for example:

  • *kVkV: severely underrepresented (predicted 18, attested 4)
  • *kVŋV: severely underrepresented (predicted 7, attested 1)
  • *kVkCV: underrepresented (predicted 19, attested 12)
  • *kVŋCV: severely underrepresented (predicted 6, attested 2)
  • *kVCkV: underrepresented (predicted 43, attested 31)
  • *kVCŋV: overrepresented (predicted 3, attested 5)

It seems to be here indeed the case that at least word roots like *kōsŋë- ‘to touch’ are patterning as POA-alternating (= not in violation of SPA). But the underrepresentation of *kVCkV does not fit this hypothesis. Though… the data could also be confounded by one of the most frequent -Ck- clusters being the homorganic *ŋk. I’d need to crunch more numbers here to say for sure.

There’s clearly much to be made of this topic; I am only scratching the surface so far.

[1] They actually use the term “medial”, but I will not, as this seems likely to be confused with “word-medial”.
[2] That is, discounting cases of local assimilation such as np > mp, mt > nt.
[3] I recall Robert Blust covering this topic in his paper __. I seem to have displaced my copy of it, though.
[4] Again, as per Juhani Lehtiranta’s Yhteissaamelainen sanasto (1989/2001).

Consonant clusters in Khanty

My previous example of phonotactic combination analysis was on data that was, despite a few kinks, still largely homogenous. But to showcase how it’s important to have a decent basic hypothesis before going into more fine-grained analysis, here’s a look at a rather different dataset. These are the medial consonants and consonant clusters from the inherited Proto-Khanty lexicon, again per Honti’s data (words with cognates elsewhere in Uralic but absent from Mansi are not included).

Some notes about notation etc. though, before I go on.

  • 1st medial consonants (“C₂”) are listed down. Possible 2nd consonants of a consonant cluster (“C₃”) are listed across.
  • I have analyzed PKh *ə as an epenthetic, non-phonemic segment that is inserted in “difficult” consonant clusters in, roughly speaking, stem-final position. E.g. *peLəm ‘lip’ = underlyingly /peLm/. Without this analysis I would be almost comically short on data.
  • *g and *x mark two segments that only contrast in Western Khanty in back-vocalic roots (as /w/ versus /χ/). Honti conflates both as *ɣ. The contrast is not (directly?) recoverable in front-vocalic roots, nor in words that have been retained only in Eastern Khanty, and seems to have been absent from the C₃ position. I have counted ambiguous cases under *g¹.
  • *L and *Ľ are cover symbols for laterals. PKh had a contrast between a fricative *ɬ and an approximant *l, and might have had even a similar contrast among the palatal laterals, but this is not recoverable in the medial position. (By contrast, the retroflex lateral *ɭ was quite certainly an approximant.)

But without further delay, here is what things look like in this part of the word root — sorted by frequency, again:

Proto-Khanty consonant clusters by consonant frequency

Already one look at this table should tell us though that it would be pointless to compare it against what an assumption of random distribution would predict. Not only are there way too many gaps, there are also several strong correlations apparent. Take for example C₂ *ń and C₃ *ć, which are both found almost solely in the cluster *ńć.

So the first step ought to be determining some basic background rules of phonotactics first. Here is the same data, now sorted by place of articulation instead:

Proto-Khanty consonant clusters by place of articulation

Several qualitative patterns are clear by now.

  • Almost all of the action goes on in the “edge” cells — those combining peripheral (bilabial/velar) and coronal (dental/alveolar/retroflex/palatal) consonants.
  • Nasal + stop/affricate clusters (highlighted in pink) are easily the most frequent type of homorganic clusters. For bilabials, palatals and velars they are the only attested cases.
  • There is a degree of coronal harmony: dentals/alveolars, retroflexes, and palatals do not combine with one another. [1] For the sibilants, nasals and laterals, this is exceptionless. The rhotic *r and the semivowel *j tolerate some exceptions, perhaps due to how the two lack counterparts at other POAs. One case with *-ćt- is attested, namely *kaćtə- ‘to hit’ — and in Northern Khanty only, actually. This is also one of the clusters that’s demonstrably secondary, as comparison to Mansi *këëćk- indicates that the word is to be segmented as *kać-tə-. Perhaps we can assume that in Proto-Khanty, this cluster still remained impossible.
  • Geminates are uniformly forbidden.

More detailed frequency analysis should probably focus just on the areas that show no obvious restrictions of this kind. And now we can easily pick out a subset of data suited for this:

Coronal + peripheral clusters in Proto-Khanty

Peripheral + coronal clusters in Proto-Khanty

The data’s still a bit scarce, but here the distribution’s at least more randomized. And hence signs of various “minor” historical developments are now able to better stand out. Plus: note that despite my presentation, this is not really two separate datasets — it’s a single, three-dimensional dataset, with cluster order as the 3rd dimension. We can for example note the disproportionally high count of *-x(ə)L- compared to a disproportionally low count of *-L(ə)g-, almost certainly an indication of the regular metathesis of PU *lk and *sk in Khanty.

A full analysis would again be much more work than I am going to just blog out on my free time, though. I have no dout that this general type of methodology, applied to any one given language, could produce a small monograph’s worth of results…

[1] A result very similar to this has been noted already by Eugene Helimski in 2002: an incompatibility of the dentals *n *t vs. retroflexes *ɳ *č in word-initial vs. word-medial position. See: “Eine Regel der Konsonantenkompatibilität im Ostjakischen”, in Veröffentlichungen der Societas Uralo-Altaica 57.
It is obvious that there were no restrictions on initial palatals though, as shown by e.g. *ńoL ‘nose’, *ńoLt- ‘to knead’, *ńeeL- ‘to swallow’, *ńuuɭəm ‘wound’, *ńaLkïï ‘Siberian fir’, *ńaaL ‘arrow’, *ńeLää ‘four’…

Close vowel reduction in Samoyedic

A well-known feature of the Samoyedic languages is a split development of Proto-Uralic *u. The standard analysis (as first proposed, IIUC, by Janhunen 1981) is that this occurred depending on the original stem type.

  • *u becomes *ə before original 2nd syllable *-a. E.g.
    • *kupsa- > *kəptå- ‘to extinguish’
    • *lupsa > *jəptå ‘dew’
    • *muka > *məkå ‘back’
    • *muna > *mənå ‘egg’
    • *mud₂a- > *məjå ‘earth’
    • *tud₂ka > *təjkå ‘tip’ [1]
  • *u generally remains before original 2nd syllable *-ə. E.g.
    • *kunśə > *kunsə ‘urine’
    • *suksə > *tutə ‘ski’
    • *suxə- > *tu- ‘to row’
    • *tulə > *tuj ‘fire’
    • *ujə- > *uə- ‘to swim’

I have not seen the details of how this happened discussed often, but it all seems straightforward enough. There are at least three reasons to believe that the former of these has been a conditional development, and the latter unconditional.

  1. The most obvious reason is identity: “*u > *u” is not even a change, but the absense of one.
  2. The second is the nature of the conditioning. Words of the 1st set are all united by the presence of *-a. Words of the 2nd do not fare as well, since 2nd syllable *-ə has been regularly lost after original light syllables / single consonants; as in ‘fire’. (I actually suspect that this change needs to be dated as very early, already prior to most individual developments of Samoyedic.) This does have the immediate result of rendering the 1st syllable closed — but this is further disrupted by the loss of medial velar consonants as in ‘ski’, ‘to row’. Moreover, *u > *ə occurs in closed syllables followed by *-a all the same; as in ‘to extinguish’, ‘dew’, ‘tip’.
  3. Finally, phonetically, the development *u > *ə would be well explainable as a simple assimilation development: when followed by the open, [-ATR] vowel *a, *u also becomes [-ATR] (that is: *ʊ), and later drops its labiality to yield *ə.

Observe though that nothing in the previous scenario is dependent on *u being a back vowel. So we should perhaps expect to see something similar also going on with the PU close front vowel *i. And indeed, examples of *i > *ə are known. So far, these have however been explained by rather different means.

  • *śilmä > *səjmä ‘eye’ is explained by Janhunen (1981) and Sammallahti (1988) as dissimilation caused by the preceding *j.
  • *itä- > *ətä- ‘to appear’, *ipsɜ(-) > *əptə(-) ‘(to) smell’ have been explained as being conditioned by the lack of a word-initial consonant by Aikio (2002).
  • Helimski (1993) lists a number of further cases, most interestingly including *pišä > *pətä ‘gall’. He notes that the conditions of the change seem unclear. [2]

(cf. my Bibliography page for the reference details)

Examining the words above as a group, the possibility of a more economic analysis seems self-evident to me: to unite *u > *ə and *i > *ə as a single “a-umlaut” affecting close vowels. Reduction and lowering of close vowels is, after all, almost always a process that operates symmetrically, without regard for vowel frontness/backness. [3] I also find it phonetically highly unclear how the absense of a word-initial consonant could trigger vowel reduction! Aikio mentions also a third example which clearly must be considered an original *ə-stem: *əm- ‘to suck’ < PU *imə- (cf. Finnish ime- ‘id.’). I wonder if this might be, rather than a regular development, due to the influence of the homonymous PSmy root *əm- ‘to eat’.

Also ‘to smell’ has a stem vocalism issue: Janhunen & Sammallahti reconstruct this as *ipsi (= *ipsə according to my transcription). But here the cognates do not seem to unambiguously point to an original non-close stem vowel. Several examples are known where a PU open stem vowel has been reduced in Samoyedic; and it even seems to be the case that this is especially frequent after consonant clusters. [4] In Samic, the only other diagnostic branch where reflexes of this root survive, both a noun *(h)ëpsë ‘smell’ < *ipsə, and a verb *(h)ëpsē- ‘to smell’ < *ipsä- are found. Also Mordvinic would normally offer evidence, but the proposed Moksha cognate /opəś/ ‘smell’ is so divergent that I wonder if it has any relation to this PU root at all. Thus the choice of which root shape to consider more original cannot be resolved, and *ipsä remains a viable option. It would even be possible to reconstruct an alternation *ipsə ‘smell’ ~ *ipsä ‘to smell’ already to Proto-Uralic. (In this case I’d have to assume that in early pre-Samoyedic, this lexeme showed an alternation *ipsə ~ *ɪpsä-, which was then levelled to uniformly *ɪpsə(-).)

It is also possible to include here the 1st and 2nd person singular pronouns: PSmy *mən, *tən. My proposed soundlaw *i-ä > *ə would mean these being fairly regularly compareable to the proto-forms *minä, *tinä indicated by most of the Uralic family: Finnic, Mari (2PS only), Permic, Hungarian (2PS only) and Khanty (1PS only). By contrast Samic and Mordvinic suggest back-vocalic *mun, *tun, but these show no evidence of a former 2nd syllable that could have triggered umlaut. Admittedly, neither do the Samoyedic words, but in them a reduction of PU *-ä to PSmy *-ə could again have applied. There is also at least one precedent for the loss of such a secondarily reduced stem vowel: *puna- >> *pən- ‘to plait’. (Medial *-n- in both this and the pronouns seems likely to be an accidental similarity.) Assuming irregular reduction in function words does not seem like a major problem.

There is another piece of evidence in support of this derivation as well. As has been shown by Helimski (1993), whether any given instance of PSmy *ə derives from earlier *i or *u is still indicated by the declension harmony class, especially in Nganasan, and recently Janhunen (2013) [5] has noted that the personal pronouns indeed align as originally front.

As an aside: if also the Samoyedic pronouns derive from *minä, *tinä, we have the additional nice consequence that the forms *mun, *tun can be considered a common Samic-Mordvinic innovation. This would eliminate the issue of two distinct pronoun sets being reconstructible to Proto-Uralic, yet no trace of a distinction between these having been attested in any Uralic language.

There is also one piece of good counterevidence for the previously assumed conditioning for *i > *ə: *i ‘top’, which seems to derive via earlier *ij from an East Uralic *ilə- ‘up, over’ (cf. Mansi *äl-, Khanty *eeL-; contrast with *wülä- indicated by West Uralic *ülä-, Mari *wü̆l, Permic *vɨl-). Despite both a following *j and a lack of a word-initial consonant, and even a status as a short function word, it has not reduced to **ə-.

So, summing up, I would suggest that the usual development of *i has been exactly parallel to that of *u:

  • *i > *ə before original 2nd syllable *ä.
    • *ipsä- > *əptə- ‘to smell’
    • *itä- > *ətə- ‘to appear’
    • *minä > *mən ‘I’
    • *pišä > *pətä ‘gall’
    • *śilmä > *səjmä ‘eye’
    • *tinä > *tən ‘thou’
  • *i > *i before original 2nd syllable *ə. E.g.
    • *ilə > *i ‘top’
    • *nimə > *nim ‘name’
    • *ńimə- > *ńim- ‘to suck’
    • *pid₁ə > *pir- ‘tall’

Given that the overall amount of data we have of the development of PU *i in Samoyedic is fairly limited, I feel that being able to root its development in processes that can be confirmed with other data (i.e. the development of *u) is a good step up from explaining its development with a grab-bag of phonetically ill-defined processes based on two or three examples.

Additional cases?

There still remains some evidence for close vowel reduction also in PU *ə-stem roots.

Although I have just noted how phonetically unmotivated sound changes, if based on limited data, might well turn out to be illusions based on accidental patterns in the data, I will venture one hypothesis for an additional condition for reduction: before PU *-ŋ-. There are two candidates for this:

  • *piŋə ‘tooth’ > *pəj ‘stone; flint’?
    Although most of this root’s reflexes have the primary meaning ‘tooth’, the rather specific meaning ‘flint’ can be also found in Finnic, and it may have also been present already in Proto-Uralic. In Samoyedic the word seems to have undergone semantic drift, first loosing its anatomic meaning, and then becoming the neutral word for ‘stone’ across much of the subfamily. — The development *ŋ > *j is unclear, although the same seems to be found in *kuŋə (or *këŋə?) > *kïj ‘moon’.
  • *suŋə > *təŋə ‘summer’?
    Here the retention of *-ə after an open syllable is very odd, but per the West Uralic evidence it is quite clear that an original *ə-stem must be reconstructed; cf. e.g. Finnish suvi ‘summer’, Northern Sami sagŋat ‘to thaw’.

This still does not exhaust the cases of inherited reduced vowels in Samoyedic, though. At least the following can also be noted:

  • PU *kulkə- > PSmy *kəj- ‘to go’?
    This may seem to recapitulate the alleged soundlaw *i > *ə / _j; but there are a number of good counterexamples against assuming this to have been regular also for *u, e.g. *ulkə > *uj ‘pole’, *tulə > *tuj ‘fire’. I suspect that the explanation could rather involve how *kulkə- seems to be an early loan from PIE *kʷelH- (or a pre- or para-IE *kulH-) ‘to turn, to go’. Perhaps the Samoyedic word has been acquired from a slightly different source than its apparent cognates. One form that would fit well here is Tocharian *kɨl-.
  • PU *küńəl(ə) > PSmy *kəńələ ‘tear’?
    Presumably a case of interference from PU *kuńa- > PSmy *kəńə- ‘to close eyes’.
  • PU #pid₂mətä > PSmy *pəjmətä ‘dark(ness)’?
    The PU reconstruction is tentative; the only other cognates are Finnic *pimedä, Udmurt /peĺmɨt/, Komi /pemɨd/. It might be possible to fix things if we were to start from something like *pid₂mä-, and to assume that the adjectival suffix on showcase here (perhaps the only case where it can be explicitly reconstructed already for PU) has originally been plain agglutivative *-tA, and not root-vowel-replacing *-ətA. In Finnic, quite a few alternations between CVCA noun/verb roots and CVCe-dA adjectivizations are found, but perhaps we should not project this too far back, and instead assume secondary vowel reduction in Finnic at some time-depth. The Permic reflexes would appear more regular in this case as well, as the other examples of the development *i > /e/ seem to be restricted to *ä-stem roots.

Secondary *i?

There are also some Proto-Samoyedic roots with apparent *i whose Uralic etymologies appear to imply original *-ä. I suspect these however have a different origin. In the Samic cognates of these, *i rather than the usual *ë appears. Similarly, the Khanty cognates have *ii rather than the usual *ee.

  • Samic *imē ~ Khanty *iimə ~ PSmy *imɜ ‘grandmother’
  • Samic *cicē ~ Khanty *čiinč ~ Selkup /čičik/ ‘small bird’
  • Samic *ńińčē ~ Selkup /ńipsə/ ‘teat’

Contrast *śilmä > Samic *čëlmē, Khanty *seem ‘eye'; *ipsä- > Samic *ëpsē-, Khanty *eepəɬ- ‘to smell’.

It is unclear where this correspondence could originate: Samic *i-ē and Khanty *ii do not have any known regular origin. There are several words where Khanty *ii appears to derive from older *e though, particularly next to semivowels (Kh *liiɬ ‘breath, soul’ ~ Finnic *lewlü ‘steam’ < PU *lewlə?; Kh *niiŋ ‘girl’ ~ Finnic *nejtej ‘id.’ < PU *nejə-?) which could suggest proto-forms such as *čej(ɜ)nčä, *ejmä here.

It is also the case that a merger of *e and *i has occurred in most of Samoyedic, with the exception of Nganasan, where *e > /ɨ/ (e.g. *wetə > /bɨʔ/ ‘water’). But a second-degree exception to this applies: *e still ends up as Ng. /i/ when before /i/ or /ü/, and this could apply to ‘grandmother': Ng. /imi/. In Proto-Samic, in turn, *e before *-ä would yield *ea, but the structure *-eaj- appears to be mostly absent (with the exception of *peajvē ‘day’, which is however an irregular reflex of older *päjwä).

Regardless of if this is on the right track for explaining these words, it should be clear that given these sets of “regularly aberrant” reflexes, positing PU *čičä, *imä cannot be justified (and these comparisons are indeed absent from the stricter PU wordlists of Janhunen and Sammallahti). If so, there seem to be no cases where PU *i before 2nd syllable *ä would be retained as Samoyedic *i.

There is a proposal that might be worth a mention here, though. Tibor Mikola, in the monograph Studien zur Geschichte der samojedischen Sprachen (2004, Studia Uralo-Altaica 45), considers reconstructing distinct *ɪ, *ʊ already for Proto-Uralic, and their merger with *i, *u on the Finno-Ugric side. He seems to have picked this up from the idiosyncratic PU reconstruction of Katz, which I’ve mentioned previously. In principle words like ‘grandmother’ above could then allow contrasting the vocalisms *ɪ-ä and *i-ä. I am skeptical however, since the assumed FU development *ɪ >*i is typologically backwards, and this line of explanation appears to depend on the belief of a special status of Samoyedic within Uralic.

Notes on the non-cardinal close vowels

Anyway, if *u-a and *i-ä > *ə are regular developments in Samoyedic… what about the other close vowels, then?

I still hold that the non-close back unrounded, “eighth” vowel of Proto-Uralic should be reconstructed as mid *ë, not as close *ï. It should then be possible to simply date the emergence of Proto-Samoyedic *ï, in words of the type *mëksa > *mïtə ‘liver’, as later than the reduction of *u and *i. This might carry some consequences, but more on those at another time.

The case of *ü is less clear. This was unambiguously a close vowel in PU. Yet, the scarce Samoyedic examples we have of this vowel in *ä-stems display *i, not *ə:

  • *d₂ümä > Smy *jimä ‘glue’
  • *küčä > Smy *kitä ‘birch bark vessel’ (with irregular *-t-)
  • perhaps: *lüwä- > Smy *jiwä- > Selkup /čü-/ ‘to shoot’ [6]

(NB the loss of rounding is not an issue: also in *ə-stems, *ü usually merged with *i, e.g. *tütkə- > *titə- ‘to open’, *nüd₁ə > *nir ‘handle’.)

Some sort of palatal cheshirization could be considered. Suppose that *ɪ, as resulting after the first stage of a-umlaut, managed to keep its timbre here, thanks to the word-initial *j and *k (the latter may have had a palatal allophone *[c] already in Proto-Samoyedic), and proceeded to climb back to *i? Still, a following *j in *səjmä ‘eye’ does not appear to have the same effect, so I am hesitant.

Alternately, perhaps this can be related to other ways in which *ü appears to not follow the example of *i and *u. Two issues come to mind:

  • In Khanty, as mentioned, the default reflex of *i is tense *ee. *u is reflected as lax *o before original consonant clusters (*suksə > *ɬok ‘ski’), but tense *oo before original single consonants (*muna > *moon ‘egg’). Yet *ü is only ever reflected as lax *ö, never tense *öö.
  • In Permic, at least in open syllables *i-ä is lowered to a secondary *e (which in Komi furthermore may yield /o/). No similar lowering applies to *ü in at least *küsä > *kɨz ‘thick’. OTOH we do have *d₂ümä > *ĺem ‘glue’?
  • I was originally also planning on mentioned some oddities involving *ü in Mansi here, but these have since the inception of this post gotten expanded into a reanalysis that I think solves them.

This could suggest that *ü was originally “one step behind” PU *i and *u, i.e. some sort of a tenser vowel. An extended system of close vowels as assumed by Mikola, Katz, etc. would have room for such a thing: assume a PU contrast between tense *i *ü, and lax *ɪ *ʊ? But perhaps not. I can think of other explanations for these issues as well. Going into extended detail is not possible here, but in brief:

  • The alternate reconstruction of Khanty *ee, *oo as lax open *ä, *a (as still reflected in the Surgut dialect) may provide a clue to understanding this asymmetry. Assuming in pre-Khanty a symmetrical situation with *e, *ö, *o for PU *i, *ü, *u, we can expect only *e and *o to continue downwards to *ä and *a. For one, roundedness contrasts in open vowels are highly marked; for two, “pressure” from secondary lax *i and *u (from PU *e and *o in roots of the shape *CVCə, and producing lax *e *o in Proto-Khanty) would have provided a motivation for the lowering of these, while there was no secondary lax *ü.
  • The early Permic vowel system, before the lowering of *i, likely had a mid front gap, brought about by the early retraction of PU *e (ultimately to Udmurt /u/, Komi /o/). In this context, a split of *i to *i, *e makes a decent amount of sense. For ‘glue’ I suspect a fronting *ɨ > *i before this, brought about by the palatalized onset consonant.

And so for now, I have no satisfactory account of the hows and whys of the development of *ü in Samoyedic.

[1] Strangely, I have found no indication of anyone having previously proposed this etymology. It is entirely unproblematic both semantically and phonetically. A small hurdle might have been that Janhunen (1977) reconstructs this as *tajkå instead, but his choice of *a and not *ə seems arbitrary. The Northern Samoyedic reflexes are somewhat irregular, while Southern Samoyedic does not distinguish *ə from *a.
[2] Also allegedly reflected in Mordvinic *pižə ‘green’. I suspect that these words derive from some descendant of Indo-Iranian *wiša ‘poison; green’. I have also entertained the idea that PSmy *pətä might be a metathesized reflex of PU *säppä ‘gall’ (the predicted reflex of this in Samoyedic would be *täpä or *täpə), but the II connection seems more promising than a random entire-syllable metathesis.
[3] It is only after the introduction of reduction that the trajectories of front *ɪ and back *ʊ seem likely to diverge. The shift *u > †ʊ > /ʌ/ in most dialects of English is one good example; the shift *i > *ɪ > *ë > /ɑ/ in large parts of the Samic languages is another. But we still also find *i > /ɪ/ in English, and *u > *ʊ > /o/ in Samic. If there are examples where one close cardinal vowel is reduced while another remains, I have not seen it.
[4] E.g. *jupta- ‘to say’ > *jəptə- ‘to count’, *kod₂ka > *kåjkə ‘spirit’, *mëksa > *mïtə ‘liver’, *peksä- > *petə- ‘to beat’. — A possible exception to this pattern though seem to be consonant clusters ending in a labial consonant: e.g. *äjmä > *äjmä ‘needle’, *kompa > *kåmpå ‘wave’, *ojwa > *åjwå ‘head’, *päjwä > *päjwä ‘sun’, and the above-mentioned *śilmä > *səjmä ‘eye’. Speculating a bit, I wonder if this could be connected to the fact that labial consonants are in Uralic common in the word-initial position and in derivational suffixes, yet word-medial *p is quite rare. Might these words have been in Proto-Uralic not indivisible roots, but instead derivatives and/or compounds? This would allow e.g. comparing the first syllable of *äjmä with PIE *h₂aḱ- ‘sharp’.
[5] Janhunen, Juha (2013). “Personal pronouns in core Altaic“. In: Robbeets, Martine; Cuyckens, Hubert (eds.) Shared Grammaticalization. With special focus on the Transeurasian languages.
[6] This reconstruction might be preferrable to traditional *lexə-: it fits Finnic *löö-, Mari *lüje- and Hungarian equally well, while Komi /lɨj-/ explicitly suggests a close labial vowel *ü or *u.

