The rooting of historical linguistics

Most of the harder problems in the methodology of historical linguistics seem to come from it being a fairly “high-order” discipline, and a relatively isolated one at that.

To an extent, this true of all humanities. With the levels of computational power currently available to us, it’s not possible to start with a couple of known physical laws and derive exact predictions about human behavior from them. The best we can do along these lines is to establish boundary conditions. And of course, most of these are sufficiently obvious from our daily experience as humans that they sound more banal than profound when spelled out: e.g. language families usually have a distribution limited to the surface of the planet, and they fail to extend up to the stratosphere, or down to the oceanic crust. :ɪ

But the historical angle complicates things. Most of the historical sciences rely heavily on evidence preserved from the past: history itself is based on written sources, and the “auxiliary historical sciences” such as archaeology on other objects preserved from the past.

And yes, historical linguistics also builds on preserved evidence from the past, mainly via philology and epigraphy. But this has only been a small initial inspiration. Most of our historical insights are instead derived from from the observation and analysis of attested modern languages, and the application of a general theory of linguistic evolution. This model is, I think, quite alien to all other humanities. (Even plenty of non-historical lines of humanities research seem to remain stuck in a pre-scientific “there are no theories, only paradigms of discourse” mire.) In this sense historical linguistics has much more in common with evolutionary biology, although I suspect that also that discipline would not be doing as well as it is without the more direct evidence from an extensive fossil record. [1]

The inevitable implication is that nothing in historical linguistics can be understood without a good grasp of the underlying theory. And yet, it seems to me that many of its premises have not often been even stated aloud. No dout this is due to how the theoretical foundation seems to have been developed on a need-to-know basis by its users, as the discipline has expanded, not by any separate class of theoreticians. Yes, starting from the Neogrammarians, many of the surface phenomena have been described, from old’uns like “regularity of sound laws” to innumerable newer achievements like “typology of semantic change in body part terminology”… but the nuts and bolts of it, that really “root” historical linguistics to its sociolinguistic foundations, not so much. There has been so much work in cataloguing the “whats” that we have had not much success yet in uncovering the “whys”.

I am not sure if my term “root” is readily understandable, or if there might be a better term available. It seems likely that this could be confused with a discipline’s internal history, at least. Which is not what I mean: I refer here to by how various sciences can be ordered in how far removed they are from the basic laws of the universe. The typical example being how all biological processes can be broken down to individual biochemical processes; all biochemical processes to chemical ones; all chemical processes to particle physical processes. The reason that biology looks very different from chemistry, or from particle physics, is that studying the behavior of macroscopic masses of particles requires very different methods from studying 10 or even 1000 of them. A phenomenon such as embryonic development could in principle be modelled in terms of individual protons and electrons, but this would require enormous amounts of efforts wasted on reiterating problems like “how does a water molecule hold together” or “what happens to a protein when it encounters a water molecule”, that have already been solved to sufficient precision for us to instead model an embryo as being built from cells that are built from cellular organelles that are built from macromolecules. A biologist — or a geologist, or a cosmologist — is not interested in the whereabouts of individual particles, but rather in their patterns of distribution at a specific scale in space and time.

The same exact principle holds for the humanities. Say, all psychology is at a certain fundamental level about neurons; but in analysing the overall behavior of the brain, built from a hundred billion neurons, the beliefs, feelings, etc. that they encode can (and must) be treated as entities in their own sake. And similarly, while the speech of one human can be studied by phonetics, neurolinguistics, and similar disciplines, it again takes different tools to study the speech of a hundred million humans sprinkled across five thousand years. We need concepts such as “isoglosses” and “etymologies” that exist only as generalizations about the idiolects of individual speakers.

Our tools, however, do not seem to decompose easily into insights about smaller and smaller groups. How exactly does sociolinguistic variation in speech end up producing clean and neat sound laws, or patterns of loanword dispersal, or language areas sharing grammatical features? I do not think we have much more than loose guesses about the workings of these processes, so far.

This type of disconnect is, of course, quite common at the biology/humanities interface, and can be sometimes found elsewhere as well (e.g. in the absense of a working theory of quantum gravity). But to see it within a single discipline — linguistics — seems to me like a situation that ought to be resolvable.

This also means that historical linguistics knowledge rests, to an extent, on questionable ground. If we do not name our implicit starting assumptions, and end up making little effort to justify them on the basis of the more elementary phenomena they emerge from, is there not a risk that our edifice of knowledge stands askew, and ends up being an excercise in the construction of an essentially abstract theory, rather than a real description of the past?

Some philosophers would at this point certainly retort that all historical inquiry, being both unverifiable and unfalsifiable in the absense of a time machine, does not exist for the purpose of creating a real description of the past, but to create compelling stories about it. OK, I say, but some of us happen to consider truth an essential component of what makes a story “compelling”. Moreover… any model of the past will also make predictions about some parts of the present that we have not examined yet, which grants all historic theories a limited degree of falsifiability.

I do not claim to have a dossier of answers to issues of this sort prepared. Perhaps one or two sketches of solutions. But, of course, questions have to be asked before they can be even begun to be answered.

[1] Arguably though one could claim that the majority of our planet’s biodiversity exists at the microscopic level, and that most of biologial history must be thus similarly approached via comparative reconstruction. But in my understanding this is a relatively new approach in evolutionary biology; while historical linguistics dove headfirst into reconstruction already back in the 19th century.

Tagged with: ,
Posted in Methodology

Some sunny words

A recent blog post from Christopher Culver brings to my attention an apparent family of Turkic word roots showing irregular variation in form: *künäš ~ *qujaš ‘sun, day, heat’. Aside from the alternation *n ~ *j (for which *ń seems to be a standard explanation), these seem to make up a neat pair of front/back variants.

I am wondering however if this relationship might be illusory, and if there might be an old Uralic loanword in Turkic involved here instead. There are a few Uralic word roots (themselves probably in some sort of an obscured correlative relationship) that seem quite relevant here:

  • *kaja ‘sun, to shine’ (> Finnic *kajasta- ‘to dawn, to shine’, Lule Sami guojijdit ‘to rise (of sun or moon)’, Samoyedic *kåjå ‘sun’, etc.)
  • *kojə ‘dawn’ (> Finnic *koi ‘dawn’, Hungarian hajnal ‘dawn’, Mansi *kuj ‘dawn’, etc.)

Of particular interest is the Hungarian word, which seems to show the exact same “suffixal” elements as Turkic. This even has a formal equivalent in Khanty: *kuuńəɬ´ ‘dawn’ (apparently showing a change *jn > *ń, in neat parallel to the change *jt > *ć that was proposed by Aikio recently [1]), coming closer yet to the Proto-Turkic form.

It’s hard to say though what the dangling element -nal is here. It’s neither an independent word root on its own, nor a regular derivational affix. If I had to speculate, a compound *kojə-n‿alŋV > *kojnal- ‘beginning of sun’ could be assembled… but this seems a bit contrived semantically. Also I am not convinced if Khanty *aaLəŋ ~ Mansi *aaɣəl ‘beginning, end, point’ is an inherited root at all. [2]

And while phonetically the Khanty form in particular seems like a prime loan original, the semantics are a bit off. Is the meaning ‘dawn’ in Hungarian and Khanty perhaps secondary, from earlier ‘sun’ or the like? Or was there instead a shift ‘dawn’ > ‘sun’ in some transmission language along the way?

Some Turkologists, I’m sure, could also see it as an obstacle that this etymology seemingly requires adhering to sigmatism (reconstructing a Proto-Turkic lateral” *l₂ that later shifts to *š in Common Turkic) over lambdaism (reconstructing PT *š that shifts later to *l in Oghur Turkic). Now, yes, from what evidence I’ve seen, I lean on the view that sigmatism is the better solution [3]… But it is, however, not an entirely inescapable assuption here. Say we instead assumed that early Oghur maintaines *l₂ for some time apart from *l (perhaps indeed as a lateral fricative [ɬ]? [4]) Then we could posit an etymological sound substitution to have occurred during propagation to the other Turkic languages: Khanty *kuuńəɬ´ → Oghur #qujal₂ → Common Turkic *qujaš.

Independent loaning to different Turkic varieties might also be chronologically preferrable to assuming loaning already to unitary Proto-Turkic. Christopher notes that *qujaš seems to have a kind of northerly-leaning distribution across the Turkic languages… not bad news for an attempted Uralic loan etymology, I’m sure.

[1] Aikio, Ante (2014): Studies in Uralic etymology II: Finnic etymologies. Linguistica Uralica 50:1.
[2] There is, yes, a rather similar word root in Finnic: *alka- ‘to begin’ — but this does not quite correspond regularly to the Ob-Ugric words, esp. on account of the discrepancy between *ŋ and *k. The vowel correspondence Kh *aa ~ Ms *aa is not typical of inherited Uralic vocabulary either.
[3] But note that this does not compel me to take a stance on the similar rhotacism/zetacism debate, nor to consider *l₂ of “Altaic” inheritance.
[4] Which even brings to mind the East Uralic shift *š > *ɬ, rather similar to the shift *š > *l posited by the lambdaist side of the Turkic debate.

Tagged with: , , , ,
Posted in Etymology

Similar Place Avoidance in language history

An interesting paper I’ve found a couple days ago: Pozdniakov, Konstantin & Segerer, Guillaume (2007). Similar Place Avoidance: A Statistical Universal. In: Linguistic Typology 11:2.

The main thesis is relatively simple: most languages of the world disfavor word roots where the word-initial and word-medial consonants have the same place of articulation; and, more generally, word roots combining two peripheral (labial, velar) or two “central” (dental, alveolar, postalveolar, retroflex, palatal) [1] consonants.

I have also independently discovered this principle some time ago in my exploration of statistical properties of phonotactics in the Uralic languages. Unlike P&S, though, my first reaction was not to assume status as a defining characteristic of Uralic in general. Certainly its occurrence in well-separated branches of the family seems to require its occurrence in Proto-Uralic as well… but who knows how much further back does it go? I do not recall seeing very many word roots shaped anything like √kag- or √bomp- in almost any Eurasian language at all, really. I have had an impression they’d be slightly more common in some Niger-Congo languages — but apparently not. (Seeing what the results are for Japanese might be also interesting; the language seems to be quite rife with words like tatami, tsunami, kami, fuku, fugu. But I am not sure if Internet Japanese™ constitutes a representative sample.)

Some further observations on the topic:

The maintenance of SPA

A question that I did not see covered in the paper is that the maintenance of SPA in languages requires a degree of diachronic stability of consonant POA classes. Now indeed, as a first approximation, while fluctuations between different types of e.g. coronals (ts > tθ > θ > ð > d > l > …) or velars (k > x > ɣ > g >…) are commonplace sound changes, it’s much rarer to see consonant evolutions such as *p >> *d or *d >> *x.

But the boundaries are still not impermeable. Quite a few relatively general sound changes are known across the world [2] that convert consonants from peripherals to centrals, or vice versa:

  • Labial > palatal: e.g. *w > *j in Hebrew
  • Coronal > labial: e.g. *θ, *ð > /f/, /v/ in Latin and the other Italic languages; similarly *t > *θ > /f/ in Rotuman
  • Coronal > velar (or uvular): e.g. *š > *x in Finnic, Spanish, Pashto…; *t > *k in Oceanic languages such as Samoan, Hawai’ian; *r > ʀ/ʁ in continental Western Europe; *ɫ >ʁ in Armenian
  • Velar > palatal: *k, *g > *c, *ɟ > tɕ, dʑ — a frequent change: Satemic IE, Romance, etc.

This raises the question of how the strength of SPA evolves in languages. Changes of the above sort, applied to a language that follows SPA, will necessarily decrease its SPA-compliance. If *š frequently co-occurs with velars, and rarely co-occurs with coronals, then a change *š > *x will introduce a larger number of velar-velar roots than velar-coronal roots. It follows that there must also exist some mechanisms that increase the SPA-compliance of a language.

A naive assumption that P&S summarily dispatch would be sound changes running in the opposite direction: place dissimilation to re-establish SPA, a la (? *kaša >) *kaxa > *taxa. Yet this is not a commonly attested type of change at all (the only example I can think of of is *t > *k only when a 2nd *t follows; attested IIRC from one of the Oceanic languages [3]), and it clearly cannot be a relevant factor.

My hypothesis is that lexical loss is not random. Suppose a language had two synonyms /maba/ and /suba/ for expressing a given concept; then over time, as the language splits into descendants, SPA-violating /maba/ would be more likely to be lost than the SPA-compliant /suba/. A motivation for this could be that SPA-violating roots are generally found to be more “childish” or “non-serious” in sound, and that they’d be more likely to go “out of fashion”. (Pop quiz: which of the sets {boob, dude, google}; {duty, goop, boogie}; {duke, good, butte} do you find the funniest-sounding?)

This is, in principle, a testable proposition. Take for example the interdental > labial shift in Latin. I would predict that PIE roots that display the change *dʰ >> /f/ ~ /b/ are more likely to be lost or marginalized in Latin (both in early Latin and later on in Vulgar Latin) when there is an original labial or velar consonant in the root as well. Or, in the other direction: I would predict to be able to trace the ancestry of words like duke, on average, a longer way back than that of words like dude.

Affricate co-occurrence

P&S further divide the SPA principle into a couple statements of different strength. The “general” version is that peripherals avoid any other peripherals, and centrals any other centrals; while the “strict” version is the rarity of, especially, word roots with two consonants of the same exact POA. They discover, however, one major divergence from even the last: the Bantu languages apparently feature a high number of word roots with two palatal consonants. I’d guess this represents an assimilation development of some sort. Perhaps the palatal series represents the merger of former palatalized alveolar and palatalized velar series? This relatively frequent development would easily leverage the apparently universal abundance of TK and KT roots to produce instead an abundance of CC roots.

— In Uralic we find no evidence for an especially strong co-occurrence of palatals. However, the postalveolar affricate *č has a strong tendency to “repeat”. There is a remarkable number of  old Uralic roots (some of these more, some less secure) such as:

  • *čača- ‘to be born’
  • *čača- ‘to walk’
  • #čEnčä ‘back’ ~ ‘tail’?
  • *čänčä ‘goose, duck’ (from Baltic *džans- < PIE *ǵʰans-)
  • *čëčə ‘duck’ (perhaps also from the above PIE root somehow)
  • #čečə(kä) ‘moment’
  • #či(n)čä ‘little bird’
  • *čoča- ‘to sweep’
  • *čo(n)čə ‘netstring’
  • *čučkə ‘block of wood’

Perhaps a partial explanation would be some sort of consonant assimilation phenomena. At least the 3rd word seems to have involved an assimilation *č-s > *č-č. And a couple of these roots are reflected in Finnic and Samic as if coming from original *ć-č  — yet not all, as shown by Finnic *häntä ‘tail’, *hetki ‘moment’, Samic *cōccë ‘netstring’ (provided the Uralic etymologies for these are valid: they all involve some irregularities). And maybe the “dissimilating” roots should hence be similarly reconstructed as dissimilar to begin with.

We could also wonder if this should be taken as evidence for an origin of some cases of *č via palatalization from earlier velars.

…and other reduplications

P&S also find, though, that at least some languages can have a tendency to favor “reduplicated” roots (their example is Wolof), with the exact same consonant in the root-initial and root-medial positions. Obviously in a language with several consonants per POA, this effect will be overshadowed by the numerous other combinations possible — so /b-b/ could end up relatively frequent, but cases like /b-m/, /b-v/, /b-f/, /b-ɓ/ etc. will still remain rare.

From my initial observations, though, this does not keep up in Uralic, where classes like “labials” are frequently limited to only a single obstruent *p, the nasal *m and the glide *w or *v. The Proto-Sami lexicon, [4] for example, contains less than two dozen PP roots, and most of them are either of the shape *m-v, *p-v; *v-m, *v-p; or *v-v. There is only one root of the shape *p-m; none of *p-p, *m-m or *m-p.

The occurring cases incidentally can be shown to be in large part secondary innovations. E.g. the 2nd class contains *vāpsē ‘blade of mitten'; *vipsë ‘skein'; *vēpsēs ‘wasp'; *vōmë ‘width'; *vōmē ‘woods'; *vōmā- ‘to notice'; *vōmtë *body cavity'; *vōmtē- ‘to sell'; *vōpējē ‘narrow bay'; *vōpērēs ‘three-year-old reindeer bull'; *vōppë ‘father-in-law'; *vōppō- ‘to pluck'; *vōpsë ‘mesh in a fishtrap'; *vōptë ‘hair. Most of the roots here seem to have involved the PS development *a-, *o- >> *vō-. All the rest involve the cluster *-ps-, though I’m not sure what to make of that fact.

Cluster complications

Another question the paper does not address is how should one analyze heterorganic consonant clusters. Most languages of the world prefer a simple CVCV syllable shape over CVCCV. The latter type is regardless fairly popular in some languages. E.g. my index of the Proto-Sami lexicon contains about 920 roots with clusters, about 600 without. So are clusters to be counted as “medial consonant preceded by a coda”, or as “medial consonant followed by another medial consonant”? Is a word such as PS *tolkē ‘feather’ more or less SPA-compliant than PS *kōlkë ‘hair’? The second does have a neat alternating POA structure; but both the syllable onsets are velars. Which of these is more relevant?

From a preliminary look, it stands out that the relative frequencies of 1st members of clusters resemble quite closely the relative frequencies of single medial consonants — while the relative frequencies of 2nd members of clusters closer resemble the relative frequencies of onset consonants. This would seem to suggest that we should indeed be comparing the first two consonants. But the details could fare differently.

Let’s take a sneak peek at velar/velar combinations for example:

  • *kVkV: severely underrepresented (predicted 18, attested 4)
  • *kVŋV: severely underrepresented (predicted 7, attested 1)
  • *kVkCV: underrepresented (predicted 19, attested 12)
  • *kVŋCV: severely underrepresented (predicted 6, attested 2)
  • *kVCkV: underrepresented (predicted 43, attested 31)
  • *kVCŋV: overrepresented (predicted 3, attested 5)

It seems to be here indeed the case that at least word roots like *kōsŋë- ‘to touch’ are patterning as POA-alternating (= not in violation of SPA). But the underrepresentation of *kVCkV does not fit this hypothesis. Though… the data could also be confounded by one of the most frequent -Ck- clusters being the homorganic *ŋk. I’d need to crunch more numbers here to say for sure.

There’s clearly much to be made of this topic; I am only scratching the surface so far.

[1] They actually use the term “medial”, but I will not, as this seems likely to be confused with “word-medial”.
[2] That is, discounting cases of local assimilation such as np > mp, mt > nt.
[3] I recall Robert Blust covering this topic in his paper __. I seem to have displaced my copy of it, though.
[4] Again, as per Juhani Lehtiranta’s Yhteissaamelainen sanasto (1989/2001).

Tagged with: , , , , , , ,
Posted in Uncategorized

Consonant clusters in Khanty

My previous example of phonotactic combination analysis was on data that was, despite a few kinks, still largely homogenous. But to showcase how it’s important to have a decent basic hypothesis before going into more fine-grained analysis, here’s a look at a rather different dataset. These are the medial consonants and consonant clusters from the inherited Proto-Khanty lexicon, again per Honti’s data (words with cognates elsewhere in Uralic but absent from Mansi are not included).

Some notes about notation etc. though, before I go on.

  • 1st medial consonants (“C₂”) are listed down. Possible 2nd consonants of a consonant cluster (“C₃”) are listed across.
  • I have analyzed PKh *ə as an epenthetic, non-phonemic segment that is inserted in “difficult” consonant clusters in, roughly speaking, stem-final position. E.g. *peLəm ‘lip’ = underlyingly /peLm/. Without this analysis I would be almost comically short on data.
  • *g and *x mark two segments that only contrast in Western Khanty in back-vocalic roots (as /w/ versus /χ/). Honti conflates both as *ɣ. The contrast is not (directly?) recoverable in front-vocalic roots, nor in words that have been retained only in Eastern Khanty, and seems to have been absent from the C₃ position. I have counted ambiguous cases under *g¹.
  • *L and *Ľ are cover symbols for laterals. PKh had a contrast between a fricative *ɬ and an approximant *l, and might have had even a similar contrast among the palatal laterals, but this is not recoverable in the medial position. (By contrast, the retroflex lateral *ɭ was quite certainly an approximant.)

But without further delay, here is what things look like in this part of the word root — sorted by frequency, again:

Proto-Khanty consonant clusters by consonant frequency

Already one look at this table should tell us though that it would be pointless to compare it against what an assumption of random distribution would predict. Not only are there way too many gaps, there are also several strong correlations apparent. Take for example C₂ *ń and C₃ *ć, which are both found almost solely in the cluster *ńć.

So the first step ought to be determining some basic background rules of phonotactics first. Here is the same data, now sorted by place of articulation instead:

Proto-Khanty consonant clusters by place of articulation

Several qualitative patterns are clear by now.

  • Almost all of the action goes on in the “edge” cells — those combining peripheral (bilabial/velar) and coronal (dental/alveolar/retroflex/palatal) consonants.
  • Nasal + stop/affricate clusters (highlighted in pink) are easily the most frequent type of homorganic clusters. For bilabials, palatals and velars they are the only attested cases.
  • There is a degree of coronal harmony: dentals/alveolars, retroflexes, and palatals do not combine with one another. [1] For the sibilants, nasals and laterals, this is exceptionless. The rhotic *r and the semivowel *j tolerate some exceptions, perhaps due to how the two lack counterparts at other POAs. One case with *-ćt- is attested, namely *kaćtə- ‘to hit’ — and in Northern Khanty only, actually. This is also one of the clusters that’s demonstrably secondary, as comparison to Mansi *këëćk- indicates that the word is to be segmented as *kać-tə-. Perhaps we can assume that in Proto-Khanty, this cluster still remained impossible.
  • Geminates are uniformly forbidden.

More detailed frequency analysis should probably focus just on the areas that show no obvious restrictions of this kind. And now we can easily pick out a subset of data suited for this:

Coronal + peripheral clusters in Proto-Khanty

Peripheral + coronal clusters in Proto-Khanty

The data’s still a bit scarce, but here the distribution’s at least more randomized. And hence signs of various “minor” historical developments are now able to better stand out. Plus: note that despite my presentation, this is not really two separate datasets — it’s a single, three-dimensional dataset, with cluster order as the 3rd dimension. We can for example note the disproportionally high count of *-x(ə)L- compared to a disproportionally low count of *-L(ə)g-, almost certainly an indication of the regular metathesis of PU *lk and *sk in Khanty.

A full analysis would again be much more work than I am going to just blog out on my free time, though. I have no dout that this general type of methodology, applied to any one given language, could produce a small monograph’s worth of results…

[1] A result very similar to this has been noted already by Eugene Helimski in 2002: an incompatibility of the dentals *n *t vs. retroflexes *ɳ *č in word-initial vs. word-medial position. See: “Eine Regel der Konsonantenkompatibilität im Ostjakischen”, in Veröffentlichungen der Societas Uralo-Altaica 57.
It is obvious that there were no restrictions on initial palatals though, as shown by e.g. *ńoL ‘nose’, *ńoLt- ‘to knead’, *ńeeL- ‘to swallow’, *ńuuɭəm ‘wound’, *ńaLkïï ‘Siberian fir’, *ńaaL ‘arrow’, *ńeLää ‘four’…

Tagged with: , , , , ,
Posted in Methodology

Phonotactics vs. protolanguages

Phonotactic analysis is probably one of the most straightforward tools for statistical etymology. There are others too — but this is an analysis method that will easily bring up a wealth of data that has no real synchronic motivation (arbitraryness of the sign, once again) yet can be assumed to reflect all sorts of historical processes of language development. Usually though in more or less fossilized form, perhaps even quite deeply so.

However, when the object of the analysis is a reconstructed protolanguage, also another option becomes available. This is to take significant quirks as instead suggesting points on which the reconstruction itself could be improved. A reconstruction is not primary data! It is allowed to make argued-for adjustments in just what the reconstruction is in the first place. (Alas, not realizing this is a somewhat common failure mode in studies mixing synchronic analysis methods with reconstructed data.)

For an example of this approach in action, here is a sneak peek at one dataset I am massaging:

OU stats 1

This table shows the co-occurrence of initial consonants and following vowels in the common Ob-Ugric lexicon, as reconstructed by Honti (1982). Since this is for the sake of an example, at this point only some small adjustments in the reconstruction have been added, nothing major. The various non-integer values are due to me splitting most reconstructions that show uncertainty in their reconstruction: e.g. the root listed as *keej-/*kööj- ‘to lek’ has been tabulated as 0.5 *kee-, 0.5 *köö-. An exception to this though is the correspondence type marked by Honti as “uu/ïï” which actually outnumbers several allegedly regular vowel correspondences, and seems to deserve a line of its own.

“B”, “BB” and “FF” moreover indicate correspondences that are sufficiently irregular that Honti has only dared to report if the data points towards a back or front vowel, and a long or short vowel.

So the question is: might we be able to determine if there is anything odd going on here? For just one example, while roots with zero onset are quite abundant, there seems to be an absence of roots beginning with *o-. But then again, random holes occur elsewhere in the table as well. So is this a sign of something being wrong with the reconstruction? a reflection of some earlier soundlaw in the development of Ob-Ugric? or perhaps, of nothing at all? Hard to say using only qualitative tools.

Forming some simple quantitative predictions from this type of data is however not hard. For a first approximation, say we assumed a fully random distribution of roots, with no interdependences in the occurrence of consonants vs. vowels. In this situation, the expected number of roots beginning with a given *CV- sequence could be calculated from just the total vowel and consonant frequencies. For example *-ää- occurs in 44/724 ≈ 6.1% of the roots; *ɬ- occurs in 53/724 ≈ 7.3% of the roots; their predicted co-occurrence is thus 0.061·0.073 ≈ 0.44% of the roots, i.e. the expectation value of roots beginning with *ɬää- is about 3.2.

Algebraically, the formula for this expectation value comes out as C·V/A, where C is the attested count of the onset, V the attested count of the vowel, and A the number of roots altogether.

The actual number of attested roots beginning with *ɬää- happens to be indeed 3 (*ɬääpət ‘7’, *ɬäärəɣ ‘ruffe’, *ɬäärɣət ‘hard’). So in this case the prediction is spot on! Many of the other CV combinations seem to work this well too, “off” by about 1 at most. But larger deviations also can be found. Here is the full table of differences between the attested and expectation values, with some color-coding applied:

OU stats 2

As an initial observation, note the gradual accumulation of random holes and peaks: a lesser number of roots are off by about 2, even fewer off by about 3, etc. Also unsurprizingly, bigger deviations are mainly found towards the upper left, where the data is denser.

At this point we could continue quantitative analysis. Making various starting assumptions about expected variance in the vocabulary and then doing a whole bunch of math would probably be able to tell us if the general patterning of the data shows statistically significant deviations or not. But… this seems like a bit too much work. For one, parts of the table would end up having to be recalculated if we were to adjust the underlying reconstruction even just a bit (e.g. by splitting a given proto-vowel in two). And for two, it is not at all obvious what should be our default hypothesis! It is already known that languages tend to prefer some phoneme combinations over others. And yet, AFAIK, a universal typology of this has yet to be developed even qualitatively. Applying detailed rigorous methodology while relying on guesstimated background assumptions would be a waste of effort.

Instead, I think at this point a qualitative human intervention can already tell us how likely is it that there is anything interesting going on here at all. Rather than aiming for assessing every single entry, let’s check out just the lowest-hanging fruit. The 5 most aberrant *CV- sequences in the data are:

  1. *wuu: +9.0
  2. *kuu: +7.4
  3. *ää: +6,9
  4. *mee: +5,2
  5. *kää: -5.0

Since my initial point is to demonstrate that calculating phoneme co-occurrence rates among a proto-language’s lexicon can reveal evidence for adjusting the reconstruction, then surely this sort of evidence should be found in this end of the data, if at all.

And indeed, it looks like that at least the first case is not an accident. In part it probably reflects the fact that the contrast between *uu- and *wuu- is not very clearly indicated in the data at all. Most Ob-Ugric varieties have lost *w before rounded vowels; and some others like Pelymka Mansi and Kazym Khanty have by contrast introduced an epenthetic *w before some rounded vowels. In other words, we may already suspect that having as many as nine roots “too many” indicates that some of Honti’s *wuu- roots here should be actually reconstructed with plain *uu- instead.

A look at Southern Mansi suggests a few good candidates. These are the words where Honti assumes shortening *uu > *u in Mansi (although this is a change he does not really present any conditioning for):

  • #668 *wuuj- >> SMs oj- ‘to swim’ (~ Pelymka wuj-, Kazym wooś-)
  • #682 *wuulɜ >> SMs olā ‘pole’ (~ Pelymka wula, Kazym wooɭ)
  • #689 *wuunč- >> SMs onš- ‘to run over’ (~ Pelymka wunš-, Kazym wuš-)
  • #708 *wuur >> SMs or ‘edge’ (~ Kazym wur)
  • but: #706 *wuur >> SMs wor ‘possibility, way’ (~ Kazym wur)

This looks like Southern Mansi may actually have maintained a contrast between *w and zero in this environment. And, better yet: Honti also fails to list any examples beginning with (zero onset plus) *uu that would have any potentially incriminating reflexes at Pelymka, Kazym, or other similar dialects. So there seems to be no obstacle to adjusting the reconstructions to *uuj- ‘to swim’, *uulɜ ‘pole’, *uunč- ‘ to run over’, *uur ‘edge’. In the case of ‘to swim’ we can even verify this with external evidence! Consider Permic *uj- ‘to swim’. Normally Permic should retain evidence of *w even before rounded vowels (as in Finnish uusi, Hungarian új ~ Komi выль /vɨlʲ/ ‘new’), but no such thing appears here.

Recognizing w-epenthesis also allows cleaning up #701 *wuupɜ ‘older sister’, where *w seems to have again been posited only on the basis of Pelymka wuup. The Khanty reflexes like Tremjugan oopïï, Kazym opi, Obdorsk apii do not support positing *w- at all. Neither does the Proto-Samoyedic cognate *apå. By external evidence, #688 *wuunč ‘nelma’ (a type of salmon) similarly seems to be a case of secondary *w: contrast Proto-Samoyedic *ånčɜ, Komi удж /udž/.

— Moreover the above type of scenario is not the only possible kind of explanation for why a particular sound sequence might be non-randomly overrepresented. A different issue seems to concern the following two words:

  • #659 *wuuč ‘town’
  • #660 *wuučəm ‘weir’

Wider Uralic etymological references generally consider these words to be based on one and the same root. Cognates such as Northern Sami oahci ‘barrier, obstacle, reef’ or Tundra Nenets ва” /waːʔ/ ‘fence’ seem to point to the original basic meaning having been simply ‘fence, obstacle’, from which the two attested meanings are easily derivable. Perhaps also #657: *wuuč- ‘to fish’ is a part of the same bundle. Honti indeed even includes small footnotes in the lexicon commenting on the possible relationship of these three words. It’s not clear to me why he regardless lists them separately.

Altogether at least eight of the roots where Honti reconstructs *wuu- seem to be superfluous in some sense. A pretty good catch for such a simple statistical tool, so far.

I’ve only taken a more casual look at the other top-5 cases, but some instances of *kuu- also might be illusory. More briefly:

  • #229 *kuuďmɜ ‘ashes': according to a recent proposal from Ante Aikio, this would be a derivative of the root listed by Honti as #227 *kuuď-/*kïïď- ‘to disappear’.
  • #261 *kuulpɜ ‘net’ is generally considered an old derivative of #245 *kuul ‘fish’.

Some less directly apparent phenomena may also have shaped the data. For one, I have here only charted out the co-occurrences of initial consonants + initial vowels. Perhaps a look at medial consonants, or the few stem vowels that are found in the data, would turn up other results. In theory it is even possible that some initial *CV- effects are the secondary product of sound changes involving medials instead. Suppose initial X had some interaction with medial Z, and this then had some interaction with vowel Y; this would already suffice to generate a correlation in some direction between X and Y. Hence, with this mode of analysis, it seems efficient to attack the data from multiple directions. Take a couple of snapshots from different angles, look thru the biggest problems that come up, recalculate the results after any adjustments… and see if this then brings to highlight any new issues.

Tagged with: , , , ,
Posted in Methodology

Primary vs. secondary *ë

I claimed in my post “Two Lemmata” that the reconstruction of Proto-Uralic *ë rests on quite firm ground by now. Regardless, it is still not too rare to see studies which fail to recognize the idea. [1] Apparently the existence of this proto-vowel cannot be yet considered to have reached the status of general consensus. Why is this?

Assuming that the relevant literature has simply gone unread might be a bit too uncharitable. I believe a better reason for why doubts persist would be that no single unified source discussing the reconstruction of this vowel is available; the information needs to be pieced together from disparate sources. I hope to have previously provided a brief overview, though, and in this post I will explore some additional complications.

Probably one obstacle has been that the evidence for *ë is not trivial. For all other PU vowels, the evidence of Finnic, which has been presumed highly archaic, can generally be taken as direct: PF *a < PU *a, PF *o < PU *o, PF *ü < PU *ü, etc. (with only minor conditional shunts). The PF vowels also generally remain intact in the descendants. And only in Finnic does the contrast *a/*ë seem to be irrecoverably lost. Hence, one necessary precondition for accepting PU *ë is to accept that the Finnic vowel system does too contain innovations, even major ones.

(You’d think there should be no need to explicitly spell out something this basic, but alas, long-outdated ideas about “key languages” have persisted for long in Uralic studies. Better safe than sorry…)

The direct evidence in East Uralic

The best evidence for the reconstruction of *ë comes instead from the quite distinct reflexes in the easternmost branches: Mansi (*ë > *ëë), Khanty (*ë > *ïï) and Samoyedic (*ë > *ë, *ï). Hungarian *ï, though it has in the modern language merged with the front vowels i ~ í, is also quite distinct in its refusal to adhere to vowel harmony. However, in general the vowel systems of these groups have been subject to much innovation, and it takes care to wring out evidence from here.

The single most important observation, I believe, is to look beyond individual details and to note that among all these four branches — i.e. across the East Uralic group in entirety — the general categories of non-open unrounded back vowels appear cognate to each other. Thus we can find correspondence sets such as the following:

  • H ín (: ina-) ~ Ms *tëën ~ Kh *ɬaan ~ Smy *čën ‘vein, sinew’
  • H nyíl (: nyila-) ~ Ms *ńëël ~ Kh *ńaaL ~ Smy *ńëj ‘arrow’
  • H nyír (: nyira-) ~ Ms *ńëërəɣ ~ Kh *ńaarəɣ ~ Smy *ńër ‘cartilage’
  • Ms *ëët ~ Kh *aapət/ɔɔpət ~ Smy *ëptə ‘hair’
  • H al- ~ Ms *jal- ~ Kh *ïïL ~ Smy *ïlə ‘under’
  • H máj ~ Ms *mëëjt ~ Kh *muukəL ~ Smy *mïtə ‘liver’
  • Ms *tëët ~ Kh *ɬïïkəL ~ Smy *tïtə ‘Swiss pine (Pinus cembra)’
  • Kh *ïïkət- ~ Smy *ïtå- ‘to put up (e.g. a net)’

The alignment is not perfect, but it’s far better than we’d expect to happen randomly. It’d take some odd coincidences to end up with this situation from an original system containing no “ë-type” vowels. [2] I suppose there is the theoretical possibility of proposing *ë to have been an East Uralic innovation, or proposing a set of similar but not identical parallel innovations in the four groups, but I have not seen this done convincingly. [3]

The individual details of course still need examination as well. A 1st-degree correction factor is to note the mainly stem-vowel conditioned split developments in Hungarian (*ë-ə > i ~ í vs. *ë-a > a ~ á) and Khanty (*ë-ə > *aa vs. *ë-a > *ïï). There is very little direct evidence for the original stem vowels in any of the Ugric languages, and the Samoyedic evidence has its limitations as well, but their western relatives help here: cf. e.g. Finnish suoni, nuoli, hapsi vs. ala-, maksa, ahtaa. You may also notice that the H and Kh splits run in largely opposite directions, and indeed I do not think any examples are known where H í or i would correspond to Kh *ïï. There are moreover also some apparent exception cases with *ë-a > *aa in Khanty, though, so the exact analysis of this split may require further fine-tuning.

Secondary *ë in Hungarian and Mansi

As 2nd-degree corrections, it also seems to be the case that East Uralic *ë-type vowels can regardless in some cases represent conditional developments from different PU vowels altogether.

One prominent source of secondary *ë is cheshirization in Mansi. In what seems likely to be a late change, expected Proto-Mansi *oo followed by a velar consonant develops to *ëë followed by a labialized velar. Typical examples include *čaŋa- > *čooŋk- > *čëëŋkʷ- ‘to hit'; *ńoxə-lə- > *ńooɣl- > *ńëëwl- ‘to follow’. (Contrast Samoyedic *čåŋå-, *ńo-.) This is a fairly self-evident change on account of being one of the only regular sources of labiovelars in Mansi (together with similar effects triggered by other labial vowels). It has previously even inspired claims that perhaps all cases of *ëë in Mansi are similarly secondary — say, in Erkki Itkonen’s mid-1900s model of Finno-Ugric vocalism. [4] However the other cases resist explanation by similarly simple conditioning. “Redistributionary” splits, which do not lead to the creation of any new phonemes or even allophones, do happen! Being able to condition the appearence of a sound in one environment is not sufficient evidence for concluding that its appearence in other positions would therefore have to be conditioned by something as well.

And indeed, we can find even contrasts (near-minimal pairs) between primary and secondary *ëë in Mansi. Consider e.g. *këŋkə- ‘to climb’ > Ms *këëŋk-; but *aŋa- ‘to open’ > Ms *ëëŋkʷ- ‘to undress’. As the shift *oo > *ëë / _K has normally left a trace in the form of the labialization of the following velar consonant, then roots like the first could only be accommodated into the system by abandoning regularity and switching to a much weaker model running on “sporadic” sound changes.


 Another sound law responsible for secondary *ë-type vowels also seems to be identifiable. This is a type of “illabiality assimilation”:  *o > *ë / _jC.

This development has long been recognized for Hungarian. E.g.:

  • *kojə-ma > *kojmV > *këjmV > hím ‘male’ (cf. Skolt Sami kuõjj ‘husband’ < PS *kōjë)
  • *pojə-ka > *pojɣV > *pëjɣV(-w) > fiú ‘boy’ (cf. Finnish poika)
  • *kojɜ-ta- > *kojðV- > *këjðV- > hízik (hízo-) ‘to become fat’ (cf. Mordvinic *kuja ‘fat’)
  • *tojə-ntV > *tojdV > *tëjdV(-w) > tidó ‘birch bark’ (cf. unsuffixed Udmurt /tuj/, Komi /toj/) [5]

The first two cases are well-known and relatively clear. I am not sure if the latter two have been previously noted, but they seem to work equally well. A fifth case might additionally be *kojə-ra > *kojrV > *këjrV > here ‘drone; testicle’ (cf. Finnish koiras ‘male’) — though it is unclear why we get here a mid vowel e, instead of the expected i ~ í. [6] It’s also interesting how hím (hime-) and here follow vowel harmony; yet the shift *k- > h- still indicates them descending from back-vocalic originals.

It is also fairly clear that the change only occurred in closed syllables: this is shown by e.g. *kojɜ > háj ‘fat’, *pojə > faj ‘species’ (though the semantic development here seems questionable), *śojə > zaj ‘noise’.

Interestingly there seems to be evidence of this change having extended to Mansi as well. At least three promising and two potential examples can be found:

  • *kojə-ra > *kojrV > *këjrV > *këër ‘male animal’ (cf. Fi. koiras)
  • *kojwV-lV > *kojlV > *këjlV > *këëĺ ‘birch’ (cf. Fi. koivu)
  • *soja-tV > *sojtV > *sëjtV > *tëëjt ‘sleeve’ (cf. Skolt Sami suäjj < PS *soajē; unsuffixed *soja > ujj in Hungarian)
  • ? *poskə > *poɣɬV > *pojɬV > *pëjɬV > *pëëjt ‘cheek’ (cf. Fi. poski)
  • ? *ńojta > *ńëjtV > *ńëëjt > *ńääjt ‘shaman’ (cf. Fi. noita)

The 4th has a kind of a chicken-and-egg problem: after primary *ë there is some evidence for a shift *ɣ > *j (e.g. *mëksa > *mëëjt ‘liver'; *wëlka- > *wëëɣl- ~ *wajt- ‘to rise’) [7], but we obviously cannot use both *ëë to condition the *j and *j to condition to the *ëë. A possible ad hoc solution would be to reconstruct something like #pojsəkə, but let’s not.

The 5th requires a shift from *ëë to *ää, seemingly due to the influence of two flanking palatal/ized consonants. It is not clear though if this should be dated to the Proto-Mansi level, or perhaps later. Northern Mansi /ńaajt/ and Southern Mansi /näjt/ could actually regularly reflect PMs ńëëjt as well: the former thru the regular lowering *ëë > *aa, the latter thru the regular fronting *ëë > *ee adjacent to palatalized consonants + vowel shortening to /ä/. For these changes a perfect parallel is PMs *ńëëraa > *ńeerää > SMs /ńärää/ ‘legwear'; [8] a word not of Uralic inheritance, but here the regular back vowel is still found in Eastern Mansi /ńëërə/, Northern Mansi /ńaara/. It is only the Eastern and Western reflexes of ‘shaman’ that point to older *ää specifically.

It’s moreover possible that the 2nd case actually indicates instead a fairly similar change: *o > *ë / _ĺ. In this light two further interesting words are PU *śod₁ka > *soĺɣV > ? *sëĺɣV > Ms *sëëĺ ‘goldeneye’ (cf. Finnish sotka); and Ms *këëĺt- ‘to peel (e.g. hamp)’, which has been compared to Mari *kŭðaša-, Komi /kuĺ-/, Udmurt /kɨĺ-/ ‘to undress’, and behind which a PU root *kod₂V- could be reconstructed. [9]

There is no clear evidence on how *-od₂- is reflected in Hungarian — this has not been a frequent sound sequence. However, one old lexical comparison (that the UEW rejects) might be rehabilitable if we assumed that also this change occurred in Hungarian: *śod₂a ‘war, fight’ (cf. Finnish sota ‘war'; Mari *šuðala- ‘to scold’) > *śod₂a-nta- > *soĺdV- > *sëĺdV- > szid ‘to scold’? A cluster simplification *ĺd (? > *ɟd) > *d would also have to be assumed though.

However, even though these changes are highly similar, there is a strange complication that seems to preclude an analysis as a common Hungarian-Mansi innovation. In most words where Hungarian points to this kind of a secondary *ë, the Mansi development differs — we see a loss of *-j- instead:

  • *kojə-ma >> *kum ‘man’
  • *kojə-ta- >> *kaat- ‘to become fat’
  • *pojə-ka >> *piw ~ NMs /piɣ/ ‘boy’
  • *tojə-ntV >> NMs /toont/ ‘birch bark’

At least the 2nd and 3rd of these are clearly irregular: *-jt- is a perfectly valid consonant cluster in Mansi (cf. ‘sleeve’ and ‘shaman’ abov), and there are no parallels for a vowel development from *o (or for that matter, any other back vowel) to Ms *i. The 1st brings to mind the developments *kojə > *kuj ‘male’, *śojə > *suj ‘sound’. Was ‘man’ perhaps derived in Mansi from a vowel-stem variant *kojəma > *kujəmV > *kujm?

Perhaps it is relevant that the irregular loss of *-j- in these words extends also to Khanty: *kaatLə- ‘to become fat’, *pak ‘son’, *tontəɣ ‘birch bark’. A fourth example of this is also known, the word for ‘louse': Ms *tääkəm, Kh *teeɣtəm (also Hungarian tetű); contrasting with Finnic *täi, Udmurt /tej/, Komi /toj/. [10] We could perhaps suppose a loss of *j before a consonant cluster to explain the last two… Though *-ktV is not really a typical Uralic noun formant, and so I also wonder if the Ugric words for ‘louse’ are not perhaps instead somehow related to the quite similar root *tikte found in Tungusic.


 In Mansi, further examples of apparent secondary *ë can still be found as well. The residue includes e.g. Fi. os-ta- ‘to buy’ ~ Ms *wëëtaa ‘ware'; Fi. otta- ‘to take’ ~ Ms *wëët- ‘to pluck’. [11] Itkonen in his critique has claimed that *ëë would be even the most frequent correspondence of West Uralic *o, and this seems to still hold up pretty well even once we remove the words showing Finnic *oo (< *a/*ë via Lehtinen’s Law) from the count. It might still be possible that there has indeed been a default development *o > *ë in Mansi, only one bled by several conditional developments. — Regardless: this type of secondary *ë must still be distinguished from primary *ë, which is instead normally reflected as *a in West Uralic, and is further supported by the Samoyedic evidence.

[1] For just one example, no mention of this result appears in what I belive is the newest overview of Hungarian historical phonology available: the fifty-odd page appendix in Andras, Róna-Tás & Árpád, Berta (2011): West Old Turkic: Turkic Loanwords in Hungarian. Wien: Harassowitz.
[2] This can be contrasted with the Western end of the family. “Ë-type” vowels are not at all unknown here either. However, these show no relation to each other. E.g. Ter Sami has the vowels /ï/ and /ïë/, from Proto-Samic *ō and *oa < PU *a and *o. Skolt Sami has õ [ɘ] and â [ɜ] plus the long versions, under various conditions from PS *ë < PU *i, *ü, *e-ə. And the various languages of the Southern Finnic areal have õ [ɤ ~ ɨ], mostly from *e, though in some cases from *o.
[3] At least Reshetnikov & Zhivlov (2011; see Bibliography) have attempted an analysis to this effect, but they do not analyze Hungarian or Khanty, and they exclude some material previously reconstructed with original *o that turns out to be quite relevant. A recent follow-up in Zhivlov (2014) has abandoned the idea.
[4] He has presented some detailed critique against the reconstruction of *ë (“Vokaaliston kysymyksiä”, 1988, Virittäjä 92 pp. 325–329), though it seems this never led to much further discussion of the matter, and after Itkonen’s death in 1992 no one else seems to have had much interest in defending his system of vowel reconstruction.
[5] An alternate reconstruction *tejɜ- would also work for the 1st syllable vocalism, but this would predict a vowel-harmony-compliant **tidVw > ˣtüdő in Hungarian.
[6] It would be possible to hypothetize e.g. that inherited *ë had already been split to Old Hungarian *i vs. *a at this date, and that *oj first yilded not *ëj, but rather *ej, which was later assimilated to *i; and that in ‘drone’, *j was then lost early, leaving a mid vowel. The Mansi evidence seems to support an earlier shift specifically via *ë, though.
[7] There are other words as well with a more limited distribution; cf. Honti 1982: 29–30. These words mostly feature an alternation between a base form with *-ëëɣ- and an oblique stem with *-aj-. I would assume that this *j was later generalized to the nominative in the body part terms ‘liver’ and ‘cheek’, which will only rarely occur as subjects.
[8] On a slightly off-topic note, I am not sure if the Southern Mansi long open stem vowels should be taken as original. They don’t seem to contrast with the corresponding short full vowels, and indeed, they correspond to short stem vowels in the other Mansi dialects. They also regularly condition shortening of 1st syllable vowels. I suspect some sort of a prosodic effect here: e.g. ˈV₁-V₂ > V₁-ˈV₂ when V₂ was a full vowel, followed by lengthening of the newly stressed V₂, and if applicable, shortening of the newly unstressed V₁.
[9] The shift *u > /ɨ/ seems to be regular in Udmurt before coda /ĺ/. Other examples include *kad₂a- > PP *koĺ- > *kuĺ- > kɨĺ- ‘to stay’ (cf. Komi koĺ-); *kod₂ka > PP *kuĺ > kɨĺ ‘disease, evil spirit’ (cf. Komi kuĺ); *neljä > PP *ńoĺ > *ńuĺ > ńɨĺ ‘4’ (cf. Komi ńoĺ). Contrast though retention before intervocalic /ĺ/ in muĺɨ ‘berry’, tuĺɨm ‘topmost yearly growth of tree’.
[10] Mari *ti is ambiguous: this could also derive from e.g. *täkV or *tikV. Samic *tikē is though probably an unrelated loan from Germanic (or perhaps from the same pre-Indo-European source as the Germanic words).
[11] These two might suggest a dissimilation *wo- >> *wëë- at first glance, but a counterexample is *woča > Fi. ota-va ‘fish trap’ ~ Ms *wooš ‘weir; fence; city’.

Tagged with: , , , , , , , ,
Posted in Reconstruction

Etymologically opaque Votic words

For later reference, here’s a collection of etymologically opaque (to me) Eastern Votic words harvested from my new dictionary. I will not attempt any detailed analysis yet. (Presumably some investigation into Russian, Ingrian, Estonian, maybe even Latvian & German could turn up known cognates for many of these.)

  • aimo ‘carbon monoxide’
  • alëtsë ‘mitten’
  • hilkeä ‘ugly’ — if this is not a hypercorrect cognate of Finnish ilkeä ‘evil’.
  • hulkkuag ‘to travel’
  • hülpeä ‘disobedient’
  • ikolookka ‘rainbow’ — a compound based on lookka ‘bow, curve’, but the 1st element is unclear.
  • jahsaag ‘to take off shoes’ — does not seem related to Finnish jaksaa ‘to have energy for’.
  • kaaliag ‘to lick’
  • kaputta ‘sock’
  • kineri ‘melted fat’
  • koltši ‘old-fashioned ladle’
  • kosma ‘hair’
  • lainatag ‘to swallow’ — does not seem related to Finnish lainata ‘to borrow’.
  • lautta ‘cowshed’ — does not seem related to Finnish lautta ‘raft’.
  • liblo ‘oat awn’
  • linnaasëd ‘malt’
  • lohko ‘soup’
  • lühtši : lühdže- ‘pail’
  • läntü ‘milk’
  • mauttši ‘intestine’
  • naka ‘cask spigot’
  • nakliska ‘some part in a sleigh’ (“the informant is unable to explain what exactly”)
  • nëikko ‘rockable cradle’
  • nättšelikko ‘burdock’
  • nätši ‘uncooked (of bread)’
  • nättü ‘rag’
  • ootava ‘cheap’
  • pallo ‘pigeon’
  • pelssimed ‘loom’
  • peltta ‘leftovers of threshing’
  • pihta ‘shoulder’
  • pilpa ‘dandruff’
  • pärähmä ‘fathom, armful’
  • raaka ‘twig’
  • ramitsaag ‘to limp’
  • ratiz ‘granary’
  • rehnüüz ‘entrance hall’
  • rehtilä ‘griddle’
  • ringuttaag ‘to stretch’
  • ripa ‘footwraps’
  • ripila ‘fireplace poker’
  • rooppa ‘porridge’
  • rootšiag ‘to dig, to rummage’
  • ruttaag ‘to hurry’
  • śalko ‘foal’
  • servä ‘edge’
  • sippelikko ‘ant’
  • sisava ‘nightingale’
  • sultsiag ‘to wash’
  • surmukaz ‘relative’ — probably not derived from surma ‘death’?
  • säblä ‘kitchen hook’
  • šitinka ‘bristle’
  • šlotta ‘slush’
  • taari ‘ale’
  • tahtši : tahdžë- (!) ‘chaff’
  • tauttaag ‘to take’
  • tiheh ‘mosquito’
  • turvaz : turpaa- ‘ladder’
  • tuutikko ‘washbundle’
  • türü ‘food comprising breadcrumbs mixed with milk or water’
  • tšiutarë ‘coldroom’
  • tšiutto ‘shirt’
  • tšäppeä ‘beautiful’
  • uhër ‘auger’
  • unka ‘wooden cup’
  • upa ‘bean’ = Est. uba.
  • ursi : urtë- ‘bed curtain’
  • vaattaag ‘to look’
  • valo ‘dung’
  • varo ‘hoop’
  • veelatag ‘to soak’ — compound with vete- : vee- ‘water’?
  • vokki ‘spindle’
  • väitšiäg ‘to call’
  • ördžähtäässäg ‘to wake’
Tagged with: , , ,
Posted in Etymology

A potential Turkic-Yukaghir loanword

A project I am working on and off is compiling lexical parallels that have been proposed in connection to various proposed external relationships of Uralic. Occasionally this kind of work turns up nice new etymological insights.

One of the best-retained — and also one of the more specific — verbs of motion reconstructible for Proto-Uralic is *kälä- ‘to wade': reflected in e.g. Northern Sami gállit ‘to wade’, Finnish kahlata ‘to wade’ (an old loan from Samic), and Hungarian kel ‘to rise’. (The meaning ‘to rise’ is found also in Mansi and Khanty; the latter also has ‘to step up on land’.)  This has been compared with the Yukaghir verb *kel- ‘to come’. The pairing is phonetically OK, but semantically it does not seem impressive. It might be acceptable if a relationship between Uralic and Yukaghir were already established, but it offers hardly any evidence for a relationship in the first place.

Interestingly enough, the same Uralic verb has also been compared with Turkic *gel- ‘to come’ — with the exact same semantics and an equally compatible phonetic shape! (E.g. already Björn Collinder in Fenno-Ugric Vocabulary, 1955/1977, reports both comparisons.) Probably the first step here should be to analyze the Yukaghir word as a loan from Siberian Turkic, and worry about any possible Uralic relationships later.

I would predict that pitting the Uralo-Yukaghir and Ural-Altaic hypotheses against each other may turn up further cases like this where a straightforward loan etymology is available. It’s already been noted by Rédei in his “Zu den uralisch-jukagirischen Sprachkontakten” (1999, in FUF 55) that many of the Uralic-Yukaghir lexical parallels extend to some of the “Altaic” languages as well…

Tagged with: , , , , ,
Posted in Etymology

Statistical etymology: A Votic example

I have last Friday picked up a dictionary of the Mahu dialect of Eastern Votic (Castreanianumin toimitteita 27, 1986), based on Lauri Kettunen’s collections from about a hundred years ago. [1]

This is not a particularly huge book, with only about 150 pages of lexical data, set in a relatively large monotype font, too. It probably won’t be of much use if one wished to e.g. translate Firefox into Votic. Its usability as tourist dictionary might be limited as well (even if we ignore the sad fact that Votic is hard moribund, with only some dozens of speakers left). But it seems like a good reference for a linguist wishing to make some contact with the language. Or: a handy unit of data for a linguist wishing to understand the lexical structure of languages.

The lexicons of natural languages are not random in their makeup. Phonemes have differing frequencies of occurrence in different positions of words; and different tendencies of combining with each other. And although one can certainly find linguists who will attempt to offer explanations in terms of elaborate synchronic phonological constraints and preferences, I find this a fundamentally flawed approach. [2] Much more often, any patterns evident in the lexicon are best understood as the fossilized results of historical processes: sound changes, loanword strata and evolving standards of sound-symbolic conventions. The study of a language’s lexicon even at a single point in time will likely turn up insights into its history.

For this type of analysis, this Votic dictionary actually seems like a rather good sample size. The lexicon of any major literary language would be both overwhelming in size (possibly thousands of pages); as well as swamped with recent cultural loanwords (if you happen to find a word shaped approx. like /banana/ or /platinum/ in a given language, this will not tell you much about its prehistory). Neither of these problems is apparent here, and it’s possible to focus on the big picture without getting stuck on data wrangling. On the other end, a simpler list yet of say 100 words, whether artificially truncated or recorded in passing in 1820 from some now-extinct language, would not allow for many statistically significant conclusions at all.


A simple starter example: the Finnic languages have, originally, not contrasted voicing in obstruents (as was the case already in Proto-Uralic). This situation still remains in place in Estonian, Northern Karelian, and dialects of Finnish. Votic, however, sits on the side of the siblings to have fully embraced voicing, and contrasts voiced and voiceless versions of all obstruent consonants: /p t tš k f s š/ ≠ /b d dž g v z ž/. Suppose we were to hand a copy of this dictionary to a linguist who’s never worked with Finnic before. Will they be able to uncover this older constraint?

The answer seems likely to be “yes”. Only minor etymological analysis is required — which the dictionary itself provides, even. The lexemes in the dictionary are glossed in both Russian and Finnish, the two major contact languages of Votic. Additionally, several words identifiable as recent Russian loans are indeed so marked. This allows an initial separation of the lexicon to two mostly disjoint layers: those of Finnic vs. Russian background. (Though of course Finnish has some Russian loanwords as well, and small amounts of words whose origin is not immediately obvious can also be found.)

A look at words beginning with voiced obstruents other than /v/, as well as words beginning with /f/ shows that they, as a rule, belong in the Russian layer. This is a small set to begin with, and after this cleanup, no more than seven counterexamples remain:

  • balalaittaag ‘to gossip’
  • bëëg ‘isn’t’
  • borissag ‘to bubble’
  • bulissag ‘to bubble’
  • börö ‘ironing board’
  • däädi ‘some relative’
  • filissaag ‘to whistle’

So we have four onomatopoetic verbs, one unstressed particle, one nursery word, and one fully legit content word. This is not sufficient evidence to postulate the voicing contrast to be original in the initial position, not when evidently inherited words beginning with /p t tš k s v/ number multiple hundreds altogether. [3]

A more detailed examination would find that medial voiced consonants other than /v/ can similarly be shown to be secondary — they occur as the consonant gradation alternants of the voiceless ones. Exceptions, as a rule, again occur only in Russian loans and probably some onomatopoeia. The full details would be more difficult to dig up though, so I am leaving this as an excercise for the interested reader. ;)

[1] In case anyone else is interested, some overflow stock of these from dunno where is still up for grabs at the University of Helsinki’s Dept. of Finno-Ugric Studies (Metsätalo/Unioninkatu 40, 4th floor).
[2] This may not be an entirely fair comparison, but… I have in mind the image of a “generative geologist” attempting to locate physical constraints present in gneiss or sediment that force its minerals to hold a macroscopically banded rather than homogenous structure.
[3] I will not dwell on /š/, also mainly a loanword phoneme.

Tagged with: , , , , ,
Posted in Methodology

Interplay of minor soundlaws: Samic glide clusters

Shifting and widening my scope a little, here’s a look into the history of two consonant clusters across the Samic languages as a whole.

The two-glide cluster *-jv- is a simple place to start. The development of this is straightforward: this is retained essentially intact everywhere across all Sami varieties. (If you want to have a look for yourself, I am including copious links to the Álgu database in this post.) Possibly the coda *j may have been vocalized into the 2nd component of a diphthong/triphthong, but this is basically trivial.

A small further complication still comes up in Southern Sami. Two words have here a seemingly irregular *-jj-: *oajvē > åejjie “head, end”; and *peajvē > biejjie “sun; day”.

Both of these happen to be are inherited Uralic words, with cognates stretching all the way to Samoyedic. So my first reflex was to go “a ha! does this mean that the words showing /jv/ are therefore newer loanwords?” The answer is “no”, though: at least *koajvō- > gåajvodh “to dig” is of equally ancient pedigree. But I think I can dial this hypothesis back a little. Perhaps the shift *jv > *jj occurred due to the following front vowel *-ē (in Southern Sami characteristically diphthongized to /ie/ even in the 2nd syllable). This seems phonetically plausible & drops the number of counterexamples from half a dozen to one: *vājvē > vaejvie “pain”. This last word is in turn a known Finnish loanword, which may have indeed diffused into Southern Sami at a late date.

This idea seems to be preliminarily further supported by an interesting derivative of “head”: åajvadidh “to advice”. My chops in SS historical morphology are insufficient to present an implicit PS reconstruction, but we can clearly see here at least a retained stem vowel *ā, a regular feature before 3rd syllable *ë; in other positions this was further raised to *ē already in PS. And before this lower vowel, *-jv- survives after all.


Now let’s consider the opposite PS cluster *-vj-. This turns out to have had a much more complicated history.

Three Sami varieties have completely regular development. Lule Sami and Ter Sami have in all involved words metathesized this cluster, merging it with *-jv-. Inari Sami has always retained /vj/. Northern Sami also might belong here, depending on who you ask: the Álgu database claims /jv/ in a single word *sāvjë > sájva “isolated lake”, while my copy of Yhteissaamelainen sanasto presents sawˈjâ (= equivalent to sávja in the current NS orthography). I would guess that there are dialectal differences involved? FWIW, Sammallahti in The Saami Languages claims that at least the Torne Sami dialect group “originally” belonged with Lule Sami rather than Northern Sami. [1]

In a couple of other varieties, it is also possible to state a mostly applicable rule. Pite Sami aligns with its sibling Lule in having -jv- everywhere except in *jēvjë > jievja “white reindeer”; while Skolt and Kildin Sami align with Inari Sami in having -vj- everywhere except in *ćōvjë > Sk čuõivâk, K čuəivex “grey reindeer”. Probably these sorts of exceptions again represent loaning from neighboring dialects. [2]

Southern Sami again shows a few more complications; as does the neighboring Ume Sami. Covering SS first, metathesized /jv/ occurs in two words: *tāvjā > daajvaj “often”, *sāvjë > saajve “gnome”. Unmetathesized /vj/ is found in three: *jēvjë > joevje “light grey reindeer”, *jēvjë- > joevjeme “beard moss” (don’t ask me what’s the oe doing in these), *vōvjē > vuevjie “wedge”. Lastly, an assimilated /jj/ is found in *ćoavjē > tjåejjie “stomach”. This appears to confirm the assimilation rule I proposed in the 1st section: v > j / j_ie. Provided that we assume the metathesis *vj > *jv to have occurred before this…

The Ume Sami reflexes seem to support this last assumption. Although not many of the involved words have been recorded from here, /jv/ is found in those lexemes that have SS /jv/ ~ /jj/: dàivài “often” and tjåìvee “stomach” — while /vj/ is found in those that have SS /vj/: jauja “grey reindeer”, vyöyjee “wedge-shaped patch”. There is also one word with a somewhat baffling three-glide reflex: guyvjas “grey reindeer” (with unetymological /g-/ to boot). [3]

How should this distinction between S+U-metathesizing and S+U-unmetathesizing *-vj- be accounted for? Could this be etymological somehow? An interesting fact is that *vōvjē “wedge” is one of the Samic words showing lenition of original coda *k before sonorants (as shown by the Finnic cognates: e.g. modern Finnish vaaja, Karelian voakie, Livonian vaigā < PF *vakja). [4] So, perhaps this change occurred only after the metathesis of inherited *vj to *jv in Southern and Ume Sami? A late date for the change has already been suspected:

This sound change cannot be reliably dated, but it may well have taken place during a relatively late phase of Proto-Saami.

(Aikio 2006: 3.11 §) [5]

With this interpretation, a “maximally hereditary” chronology would be:

  1. Lenition *kj > *ɣj in Finnish.
  2. Samic *tāvjā “frequent” is loaned from Finnish *taɣja.
  3. Metathesis *vj > *jv in South & Ume.
  4. Lenition *kj > *vj all across Samic.
  5. Samic *jēvjë “white reindeer” is loaned from Germanic.
  6. Assimilation *v > j / j_ie in South. — Metathesis *vj > jv in Pite, Lule & Ter. — Raising *eu > *iu in Germanic.
  7. Samic *vājvē “pain” is loaned from Finnish vaiva.

…But is it a good idea to attempt maximizing the degree to which various Samic words would have been inherited from a common ancestor? I think it is important to keep in mind that fresh loanwords readily diffuse across dialect continua.

As for the particular downsides of the abov scenario, at minimum I am uncomfortable assuming that the specifically Finnish change *kj > *ɣj occurred earlier than the supposedly Proto-Germanic change *eu > *iu / _j. [6] OK, it’d be possible to go on making some cleanup assumptions; e.g. that in the numerous newer Germanic loans in Finnish where *ɣj can be reconstructed, this was substituted for original *kj; or perhaps, that the /k/ ~ /g/ found in the other Finnic languages would be a reversal from *ɣ; but this would all be for no other reason than ensuring a Proto-Samic ancestry for SS daajvaj, US dàivài. We could instead assume that S+U acquired these words from the direction of P+L, and show /jv/ for this reason.

This should also call into question whether my step 3 above existed at all. *sāvjë “gnome” (elsewhere in Samic also with meanings like “underground water”, “lake with an underworld entrance”, “isolated lake”) seems like a potential cultural loan from the P+L direction at least. It is of Germanic ultimate origin, but seems to have acquired its mythical flavor only on the Sami side: the PGmc root is simply *saiwiz “lake”.

Note moreover that this loan etymology actually predicts PS *-jv-, not *-vj-! And yet there is no evidence for the inverse metathesis *-jv- > *-vj- to have regularly occurred in any Samic variety. So are we therefore forced to furthermore conclude that this word was originally adopted specifically in the Pite/Lule area, and hypercorrectly metathesized to *-vj- when loaned eastward from these varieties? The Southern /jv/ could similarly also turn out to be original after all.

This leaves just the question of *ćoavjē “stomach”. Relationship to Samoyedic *t¹äjwə “stomach” has been proposed. The initial consonant, vowel frontness, and glide cluster order all fail to match, though, so I suspect this is only an accidental resemblance. I could just as well propose that the Samic word is a metathesis from something like earlier *voaćjē, and therefore related to Finnic *vacca “stomach”? (Ha ha.) With the case for inheritance being in this shape, I don’t think it would be too much of a problem to assume that here, too, the S+U forms have been loaned from the direction of P+L. — But still early enough to have participated in cluster smoothing in SS, apparently.


An additional topic to ponder at this point would be the motivation of the metathesis *-vj- > *-jv-, which altogether appears to be attested in at least two widely separated parts of the Samic dialect continuum. Pite and Lule Sami are spoken in northern Sweden and adjacent areas of Norway (also Finland if we count Torne Sami), Ter Sami at the eastern end of the Kola peninsula. It seems unlikely that these groups have been in any direct contact with each other since Proto-Samic times. It also seems unlikely that this incredibly specific metathesis was purely coincidentally innovated in both. One possibility might be some kind of a phonological precondition for this change having existed already in Proto-Samic, which in only two areas led to the change running to completion?

A better solution though might be a common external source. This exact same metathesis happens to be known furthermore from the Finnic languages! Late Proto-Finnic allows no *-vj- (or *-Vuj-/*-Vüj-: we are better off reconstructing diphthongs rather than coda glides at this date), and although no words with PU *-wj- have been retained in Finnic, a number of loanwords allow reconstructing a metathesis here. E.g. PGmc *flauja- → Finnish laiva “ship”. [7] Metatheses of some other similar clusters including older *-wr- (PS *jāvrē ~ LPF *järvi “lake”) are also found, which suggests that this type of change originated in Finnic, and might have been in the case of *-vj- > *-jv- passed on to Samic.

Still, why just these specific varieties? The Lule Sami probably had numerous connections with Finnic traders and settlers in the Torne Valley and adjacent areas since a much older period than the Finnmark/Inari/Skolt/Kildin Sami living further inland, that much is clear. Yet should we expect this shift to have therefore also been also present in the extinct “southeastern” Sami varieties such as the marginally attested Kemi Sami?

Particularly difficult to understand is Ter Sami. I do not think we even know at present whether the Kola Sami languages developed entirely in situ, or if they may have spread to Kola from e.g. the southern reaches of the White Sea, some of their characteristic features already in tow? The presence of this sound change might demand, at minimum, for Ter to descend from a dialect that was originally spoken further south than the corresponding ancestral dialect of Kildin…

[1] One wonders how and why may we claim that it no longer does; or whether we are to conclude that “Northern Sami” is an areal entity rather than a genealogical one.
[2] I wonder if these last two words have some relation to each other. The semantic closeness is obvious, and the consonant skeletons are quite similar as well. The proposed etymology for *jēvjë is loaning from (pre-?)Proto-Germanic *heuja- “hue”, and the Germanic *h- moreover comes from PIE *ḱ-. Wiktionary mentions here e.g. Lithuanian šývas “white”. Could any of the Satemic cognates have plausibly been loaned to yield pre-Samic *ćawjə or *ćowjə “grey”?)
[3] Or could this indicate a substitution *ḱ- >*k-, from some non-Satem variety? Perhaps not, since this would be chronologically problematic and there are other known examples of irregular *ć- > *k- in some varieties of Sami.
[4] Also by the word’s etymology as stemming from Baltic: cf. Lithuanian vagis, Latvian vadzis. For more details cf. Itkonen, Terho (1982): Laaja, lavea, lakea ja laakea. In: Virittäjä 86.
[5] Aikio, Ante (2006): On Germanic-Saami contacts and Saami prehistory. In: SUSA 91.
[6] I actually suspect this was “only” Northwest Germanic, given how Gothic shifts *e to *i always anyway. More details to come on this point later though. At any rate this would still not be a huge chronological relief.
[7] For further details cf. Koivulehto, Jorma (1970): Suomen laiva-sanasta. In: Virittäjä 74.

Tagged with: , , , , , , ,
Posted in Reconstruction
Follow

Get every new post delivered to your Inbox.