Phonotactics vs. protolanguages

Phonotactic analysis is probably one of the most straightforward tools for statistical etymology. There are others too — but this is an analysis method that will easily bring up a wealth of data that has no real synchronic motivation (arbitraryness of the sign, once again) yet can be assumed to reflect all sorts of historical processes of language development. Usually though in more or less fossilized form, perhaps even quite deeply so.

However, when the object of the analysis is a reconstructed protolanguage, also another option becomes available. This is to take significant quirks as instead suggesting points on which the reconstruction itself could be improved. A reconstruction is not primary data! It is allowed to make argued-for adjustments in just what the reconstruction is in the first place. (Alas, not realizing this is a somewhat common failure mode in studies mixing synchronic analysis methods with reconstructed data.)

For an example of this approach in action, here is a sneak peek at one dataset I am massaging:

OU stats 1

This table shows the co-occurrence of initial consonants and following vowels in the common Ob-Ugric lexicon, as reconstructed by Honti (1982). Since this is for the sake of an example, at this point only some small adjustments in the reconstruction have been added, nothing major. The various non-integer values are due to me splitting most reconstructions that show uncertainty in their reconstruction: e.g. the root listed as *keej-/*kööj- ‘to lek’ has been tabulated as 0.5 *kee-, 0.5 *köö-. An exception to this though is the correspondence type marked by Honti as “uu/ïï” which actually outnumbers several allegedly regular vowel correspondences, and seems to deserve a line of its own.

“B”, “BB” and “FF” moreover indicate correspondences that are sufficiently irregular that Honti has only dared to report if the data points towards a back or front vowel, and a long or short vowel.

So the question is: might we be able to determine if there is anything odd going on here? For just one example, while roots with zero onset are quite abundant, there seems to be an absence of roots beginning with *o-. But then again, random holes occur elsewhere in the table as well. So is this a sign of something being wrong with the reconstruction? a reflection of some earlier soundlaw in the development of Ob-Ugric? or perhaps, of nothing at all? Hard to say using only qualitative tools.

Forming some simple quantitative predictions from this type of data is however not hard. For a first approximation, say we assumed a fully random distribution of roots, with no interdependences in the occurrence of consonants vs. vowels. In this situation, the expected number of roots beginning with a given *CV- sequence could be calculated from just the total vowel and consonant frequencies. For example *-ää- occurs in 44/724 ≈ 6.1% of the roots; *ɬ- occurs in 53/724 ≈ 7.3% of the roots; their predicted co-occurrence is thus 0.061·0.073 ≈ 0.44% of the roots, i.e. the expectation value of roots beginning with *ɬää- is about 3.2.

Algebraically, the formula for this expectation value comes out as C·V/A, where C is the attested count of the onset, V the attested count of the vowel, and A the number of roots altogether.

The actual number of attested roots beginning with *ɬää- happens to be indeed 3 (*ɬääpət ‘7’, *ɬäärəɣ ‘ruffe’, *ɬäärɣət ‘hard’). So in this case the prediction is spot on! Many of the other CV combinations seem to work this well too, “off” by about 1 at most. But larger deviations also can be found. Here is the full table of differences between the attested and expectation values, with some color-coding applied:

OU stats 2

As an initial observation, note the gradual accumulation of random holes and peaks: a lesser number of roots are off by about 2, even fewer off by about 3, etc. Also unsurprizingly, bigger deviations are mainly found towards the upper left, where the data is denser.

At this point we could continue quantitative analysis. Making various starting assumptions about expected variance in the vocabulary and then doing a whole bunch of math would probably be able to tell us if the general patterning of the data shows statistically significant deviations or not. But… this seems like a bit too much work. For one, parts of the table would end up having to be recalculated if we were to adjust the underlying reconstruction even just a bit (e.g. by splitting a given proto-vowel in two). And for two, it is not at all obvious what should be our default hypothesis! It is already known that languages tend to prefer some phoneme combinations over others. And yet, AFAIK, a universal typology of this has yet to be developed even qualitatively. Applying detailed rigorous methodology while relying on guesstimated background assumptions would be a waste of effort.

Instead, I think at this point a qualitative human intervention can already tell us how likely is it that there is anything interesting going on here at all. Rather than aiming for assessing every single entry, let’s check out just the lowest-hanging fruit. The 5 most aberrant *CV- sequences in the data are:

  1. *wuu: +9.0
  2. *kuu: +7.4
  3. *ää: +6,9
  4. *mee: +5,2
  5. *kää: -5.0

Since my initial point is to demonstrate that calculating phoneme co-occurrence rates among a proto-language’s lexicon can reveal evidence for adjusting the reconstruction, then surely this sort of evidence should be found in this end of the data, if at all.

And indeed, it looks like that at least the first case is not an accident. In part it probably reflects the fact that the contrast between *uu- and *wuu- is not very clearly indicated in the data at all. Most Ob-Ugric varieties have lost *w before rounded vowels; and some others like Pelymka Mansi and Kazym Khanty have by contrast introduced an epenthetic *w before some rounded vowels. In other words, we may already suspect that having as many as nine roots “too many” indicates that some of Honti’s *wuu- roots here should be actually reconstructed with plain *uu- instead.

A look at Southern Mansi suggests a few good candidates. These are the words where Honti assumes shortening *uu > *u in Mansi (although this is a change he does not really present any conditioning for):

  • #668 *wuuj- >> SMs oj- ‘to swim’ (~ Pelymka wuj-, Kazym wooś-)
  • #682 *wuulɜ >> SMs olā ‘pole’ (~ Pelymka wula, Kazym wooɭ)
  • #689 *wuunč- >> SMs onš- ‘to run over’ (~ Pelymka wunš-, Kazym wuš-)
  • #708 *wuur >> SMs or ‘edge’ (~ Kazym wur)
  • but: #706 *wuur >> SMs wor ‘possibility, way’ (~ Kazym wur)

This looks like Southern Mansi may actually have maintained a contrast between *w and zero in this environment. And, better yet: Honti also fails to list any examples beginning with (zero onset plus) *uu that would have any potentially incriminating reflexes at Pelymka, Kazym, or other similar dialects. So there seems to be no obstacle to adjusting the reconstructions to *uuj- ‘to swim’, *uulɜ ‘pole’, *uunč- ‘ to run over’, *uur ‘edge’. In the case of ‘to swim’ we can even verify this with external evidence! Consider Permic *uj- ‘to swim’. Normally Permic should retain evidence of *w even before rounded vowels (as in Finnish uusi, Hungarian új ~ Komi выль /vɨlʲ/ ‘new’), but no such thing appears here.

Recognizing w-epenthesis also allows cleaning up #701 *wuupɜ ‘older sister’, where *w seems to have again been posited only on the basis of Pelymka wuup. The Khanty reflexes like Tremjugan oopïï, Kazym opi, Obdorsk apii do not support positing *w- at all. Neither does the Proto-Samoyedic cognate *apå. By external evidence, #688 *wuunč ‘nelma’ (a type of salmon) similarly seems to be a case of secondary *w: contrast Proto-Samoyedic *ånčɜ, Komi удж /udž/.

— Moreover the above type of scenario is not the only possible kind of explanation for why a particular sound sequence might be non-randomly overrepresented. A different issue seems to concern the following two words:

  • #659 *wuuč ‘town’
  • #660 *wuučəm ‘weir’

Wider Uralic etymological references generally consider these words to be based on one and the same root. Cognates such as Northern Sami oahci ‘barrier, obstacle, reef’ or Tundra Nenets ва” /waːʔ/ ‘fence’ seem to point to the original basic meaning having been simply ‘fence, obstacle’, from which the two attested meanings are easily derivable. Perhaps also #657: *wuuč- ‘to fish’ is a part of the same bundle. Honti indeed even includes small footnotes in the lexicon commenting on the possible relationship of these three words. It’s not clear to me why he regardless lists them separately.

Altogether at least eight of the roots where Honti reconstructs *wuu- seem to be superfluous in some sense. A pretty good catch for such a simple statistical tool, so far.

I’ve only taken a more casual look at the other top-5 cases, but some instances of *kuu- also might be illusory. More briefly:

  • #229 *kuuďmɜ ‘ashes': according to a recent proposal from Ante Aikio, this would be a derivative of the root listed by Honti as #227 *kuuď-/*kïïď- ‘to disappear’.
  • #261 *kuulpɜ ‘net’ is generally considered an old derivative of #245 *kuul ‘fish’.

Some less directly apparent phenomena may also have shaped the data. For one, I have here only charted out the co-occurrences of initial consonants + initial vowels. Perhaps a look at medial consonants, or the few stem vowels that are found in the data, would turn up other results. In theory it is even possible that some initial *CV- effects are the secondary product of sound changes involving medials instead. Suppose initial X had some interaction with medial Z, and this then had some interaction with vowel Y; this would already suffice to generate a correlation in some direction between X and Y. Hence, with this mode of analysis, it seems efficient to attack the data from multiple directions. Take a couple of snapshots from different angles, look thru the biggest problems that come up, recalculate the results after any adjustments… and see if this then brings to highlight any new issues.

Primary vs. secondary *ë

I claimed in my post “Two Lemmata” that the reconstruction of Proto-Uralic *ë rests on quite firm ground by now. Regardless, it is still not too rare to see studies which fail to recognize the idea. [1] Apparently the existence of this proto-vowel cannot be yet considered to have reached the status of general consensus. Why is this?

Assuming that the relevant literature has simply gone unread might be a bit too uncharitable. I believe a better reason for why doubts persist would be that no single unified source discussing the reconstruction of this vowel is available; the information needs to be pieced together from disparate sources. I hope to have previously provided a brief overview, though, and in this post I will explore some additional complications.

Probably one obstacle has been that the evidence for *ë is not trivial. For all other PU vowels, the evidence of Finnic, which has been presumed highly archaic, can generally be taken as direct: PF *a < PU *a, PF *o < PU *o, PF *ü < PU *ü, etc. (with only minor conditional shunts). The PF vowels also generally remain intact in the descendants. And only in Finnic does the contrast *a/*ë seem to be irrecoverably lost. Hence, one necessary precondition for accepting PU *ë is to accept that the Finnic vowel system does too contain innovations, even major ones.

(You’d think there should be no need to explicitly spell out something this basic, but alas, long-outdated ideas about “key languages” have persisted for long in Uralic studies. Better safe than sorry…)

The direct evidence in East Uralic

The best evidence for the reconstruction of *ë comes instead from the quite distinct reflexes in the easternmost branches: Mansi (*ë > *ëë), Khanty (*ë > *ïï) and Samoyedic (*ë > *ë, *ï). Hungarian *ï, though it has in the modern language merged with the front vowels i ~ í, is also quite distinct in its refusal to adhere to vowel harmony. However, in general the vowel systems of these groups have been subject to much innovation, and it takes care to wring out evidence from here.

The single most important observation, I believe, is to look beyond individual details and to note that among all these four branches — i.e. across the East Uralic group in entirety — the general categories of non-open unrounded back vowels appear cognate to each other. Thus we can find correspondence sets such as the following:

  • H ín (: ina-) ~ Ms *tëën ~ Kh *ɬaan ~ Smy *čën ‘vein, sinew’
  • H nyíl (: nyila-) ~ Ms *ńëël ~ Kh *ńaaL ~ Smy *ńëj ‘arrow’
  • H nyír (: nyira-) ~ Ms *ńëërəɣ ~ Kh *ńaarəɣ ~ Smy *ńër ‘cartilage’
  • Ms *ëët ~ Kh *aapət/ɔɔpət ~ Smy *ëptə ‘hair’
  • H al- ~ Ms *jal- ~ Kh *ïïL ~ Smy *ïlə ‘under’
  • H máj ~ Ms *mëëjt ~ Kh *muukəL ~ Smy *mïtə ‘liver’
  • Ms *tëët ~ Kh *ɬïïkəL ~ Smy *tïtə ‘Swiss pine (Pinus cembra)’
  • Kh *ïïkət- ~ Smy *ïtå- ‘to put up (e.g. a net)’

The alignment is not perfect, but it’s far better than we’d expect to happen randomly. It’d take some odd coincidences to end up with this situation from an original system containing no “ë-type” vowels. [2] I suppose there is the theoretical possibility of proposing *ë to have been an East Uralic innovation, or proposing a set of similar but not identical parallel innovations in the four groups, but I have not seen this done convincingly. [3]

The individual details of course still need examination as well. A 1st-degree correction factor is to note the mainly stem-vowel conditioned split developments in Hungarian (*ë-ə > i ~ í vs. *ë-a > a ~ á) and Khanty (*ë-ə > *aa vs. *ë-a > *ïï). There is very little direct evidence for the original stem vowels in any of the Ugric languages, and the Samoyedic evidence has its limitations as well, but their western relatives help here: cf. e.g. Finnish suoni, nuoli, hapsi vs. ala-, maksa, ahtaa. You may also notice that the H and Kh splits run in largely opposite directions, and indeed I do not think any examples are known where H í or i would correspond to Kh *ïï. There are moreover also some apparent exception cases with *ë-a > *aa in Khanty, though, so the exact analysis of this split may require further fine-tuning.

Secondary *ë in Hungarian and Mansi

As 2nd-degree corrections, it also seems to be the case that East Uralic *ë-type vowels can regardless in some cases represent conditional developments from different PU vowels altogether.

One prominent source of secondary *ë is cheshirization in Mansi. In what seems likely to be a late change, expected Proto-Mansi *oo followed by a velar consonant develops to *ëë followed by a labialized velar. Typical examples include *čaŋa- > *čooŋk- > *čëëŋkʷ- ‘to hit'; *ńoxə-lə- > *ńooɣl- > *ńëëwl- ‘to follow’. (Contrast Samoyedic *čåŋå-, *ńo-.) This is a fairly self-evident change on account of being one of the only regular sources of labiovelars in Mansi (together with similar effects triggered by other labial vowels). It has previously even inspired claims that perhaps all cases of *ëë in Mansi are similarly secondary — say, in Erkki Itkonen’s mid-1900s model of Finno-Ugric vocalism. [4] However the other cases resist explanation by similarly simple conditioning. “Redistributionary” splits, which do not lead to the creation of any new phonemes or even allophones, do happen! Being able to condition the appearence of a sound in one environment is not sufficient evidence for concluding that its appearence in other positions would therefore have to be conditioned by something as well.

And indeed, we can find even contrasts (near-minimal pairs) between primary and secondary *ëë in Mansi. Consider e.g. *këŋkə- ‘to climb’ > Ms *këëŋk-; but *aŋa- ‘to open’ > Ms *ëëŋkʷ- ‘to undress’. As the shift *oo > *ëë / _K has normally left a trace in the form of the labialization of the following velar consonant, then roots like the first could only be accommodated into the system by abandoning regularity and switching to a much weaker model running on “sporadic” sound changes.

 Another sound law responsible for secondary *ë-type vowels also seems to be identifiable. This is a type of “illabiality assimilation”:  *o > *ë / _jC.

This development has long been recognized for Hungarian. E.g.:

  • *kojə-ma > *kojmV > *këjmV > hím ‘male’ (cf. Skolt Sami kuõjj ‘husband’ < PS *kōjë)
  • *pojə-ka > *pojɣV > *pëjɣV(-w) > fiú ‘boy’ (cf. Finnish poika)
  • *kojɜ-ta- > *kojðV- > *këjðV- > hízik (hízo-) ‘to become fat’ (cf. Mordvinic *kuja ‘fat’)
  • *tojə-ntV > *tojdV > *tëjdV(-w) > tidó ‘birch bark’ (cf. unsuffixed Udmurt /tuj/, Komi /toj/) [5]

The first two cases are well-known and relatively clear. I am not sure if the latter two have been previously noted, but they seem to work equally well. A fifth case might additionally be *kojə-ra > *kojrV > *këjrV > here ‘drone; testicle’ (cf. Finnish koiras ‘male’) — though it is unclear why we get here a mid vowel e, instead of the expected i ~ í. [6] It’s also interesting how hím (hime-) and here follow vowel harmony; yet the shift *k- > h- still indicates them descending from back-vocalic originals.

It is also fairly clear that the change only occurred in closed syllables: this is shown by e.g. *kojɜ > háj ‘fat’, *pojə > faj ‘species’ (though the semantic development here seems questionable), *śojə > zaj ‘noise’.

Interestingly there seems to be evidence of this change having extended to Mansi as well. At least three promising and two potential examples can be found:

  • *kojə-ra > *kojrV > *këjrV > *këër ‘male animal’ (cf. Fi. koiras)
  • *kojwV-lV > *kojlV > *këjlV > *këëĺ ‘birch’ (cf. Fi. koivu)
  • *soja-tV > *sojtV > *sëjtV > *tëëjt ‘sleeve’ (cf. Skolt Sami suäjj < PS *soajē; unsuffixed *soja > ujj in Hungarian)
  • ? *poskə > *poɣɬV > *pojɬV > *pëjɬV > *pëëjt ‘cheek’ (cf. Fi. poski)
  • ? *ńojta > *ńëjtV > *ńëëjt > *ńääjt ‘shaman’ (cf. Fi. noita)

The 4th has a kind of a chicken-and-egg problem: after primary *ë there is some evidence for a shift *ɣ > *j (e.g. *mëksa > *mëëjt ‘liver'; *wëlka- > *wëëɣl- ~ *wajt- ‘to rise’) [7], but we obviously cannot use both *ëë to condition the *j and *j to condition to the *ëë. A possible ad hoc solution would be to reconstruct something like #pojsəkə, but let’s not.

The 5th requires a shift from *ëë to *ää, seemingly due to the influence of two flanking palatal/ized consonants. It is not clear though if this should be dated to the Proto-Mansi level, or perhaps later. Northern Mansi /ńaajt/ and Southern Mansi /näjt/ could actually regularly reflect PMs ńëëjt as well: the former thru the regular lowering *ëë > *aa, the latter thru the regular fronting *ëë > *ee adjacent to palatalized consonants + vowel shortening to /ä/. For these changes a perfect parallel is PMs *ńëëraa > *ńeerää > SMs /ńärää/ ‘legwear'; [8] a word not of Uralic inheritance, but here the regular back vowel is still found in Eastern Mansi /ńëërə/, Northern Mansi /ńaara/. It is only the Eastern and Western reflexes of ‘shaman’ that point to older *ää specifically.

It’s moreover possible that the 2nd case actually indicates instead a fairly similar change: *o > *ë / _ĺ. In this light two further interesting words are PU *śod₁ka > *soĺɣV > ? *sëĺɣV > Ms *sëëĺ ‘goldeneye’ (cf. Finnish sotka); and Ms *këëĺt- ‘to peel (e.g. hamp)’, which has been compared to Mari *kŭðaša-, Komi /kuĺ-/, Udmurt /kɨĺ-/ ‘to undress’, and behind which a PU root *kod₂V- could be reconstructed. [9]

There is no clear evidence on how *-od₂- is reflected in Hungarian — this has not been a frequent sound sequence. However, one old lexical comparison (that the UEW rejects) might be rehabilitable if we assumed that also this change occurred in Hungarian: *śod₂a ‘war, fight’ (cf. Finnish sota ‘war'; Mari *šuðala- ‘to scold’) > *śod₂a-nta- > *soĺdV- > *sëĺdV- > szid ‘to scold’? A cluster simplification *ĺd (? > *ɟd) > *d would also have to be assumed though.

However, even though these changes are highly similar, there is a strange complication that seems to preclude an analysis as a common Hungarian-Mansi innovation. In most words where Hungarian points to this kind of a secondary *ë, the Mansi development differs — we see a loss of *-j- instead:

  • *kojə-ma >> *kum ‘man’
  • *kojə-ta- >> *kaat- ‘to become fat’
  • *pojə-ka >> *piw ~ NMs /piɣ/ ‘boy’
  • *tojə-ntV >> NMs /toont/ ‘birch bark’

At least the 2nd and 3rd of these are clearly irregular: *-jt- is a perfectly valid consonant cluster in Mansi (cf. ‘sleeve’ and ‘shaman’ abov), and there are no parallels for a vowel development from *o (or for that matter, any other back vowel) to Ms *i. The 1st brings to mind the developments *kojə > *kuj ‘male’, *śojə > *suj ‘sound’. Was ‘man’ perhaps derived in Mansi from a vowel-stem variant *kojəma > *kujəmV > *kujm?

Perhaps it is relevant that the irregular loss of *-j- in these words extends also to Khanty: *kaatLə- ‘to become fat’, *pak ‘son’, *tontəɣ ‘birch bark’. A fourth example of this is also known, the word for ‘louse': Ms *tääkəm, Kh *teeɣtəm (also Hungarian tetű); contrasting with Finnic *täi, Udmurt /tej/, Komi /toj/. [10] We could perhaps suppose a loss of *j before a consonant cluster to explain the last two… Though *-ktV is not really a typical Uralic noun formant, and so I also wonder if the Ugric words for ‘louse’ are not perhaps instead somehow related to the quite similar root *tikte found in Tungusic.

 In Mansi, further examples of apparent secondary *ë can still be found as well. The residue includes e.g. Fi. os-ta- ‘to buy’ ~ Ms *wëëtaa ‘ware'; Fi. otta- ‘to take’ ~ Ms *wëët- ‘to pluck’. [11] Itkonen in his critique has claimed that *ëë would be even the most frequent correspondence of West Uralic *o, and this seems to still hold up pretty well even once we remove the words showing Finnic *oo (< *a/*ë via Lehtinen’s Law) from the count. It might still be possible that there has indeed been a default development *o > *ë in Mansi, only one bled by several conditional developments. — Regardless: this type of secondary *ë must still be distinguished from primary *ë, which is instead normally reflected as *a in West Uralic, and is further supported by the Samoyedic evidence.

[1] For just one example, no mention of this result appears in what I belive is the newest overview of Hungarian historical phonology available: the fifty-odd page appendix in Andras, Róna-Tás & Árpád, Berta (2011): West Old Turkic: Turkic Loanwords in Hungarian. Wien: Harassowitz.
[2] This can be contrasted with the Western end of the family. “Ë-type” vowels are not at all unknown here either. However, these show no relation to each other. E.g. Ter Sami has the vowels /ï/ and /ïë/, from Proto-Samic *ō and *oa < PU *a and *o. Skolt Sami has õ [ɘ] and â [ɜ] plus the long versions, under various conditions from PS *ë < PU *i, *ü, *e-ə. And the various languages of the Southern Finnic areal have õ [ɤ ~ ɨ], mostly from *e, though in some cases from *o.
[3] At least Reshetnikov & Zhivlov (2011; see Bibliography) have attempted an analysis to this effect, but they do not analyze Hungarian or Khanty, and they exclude some material previously reconstructed with original *o that turns out to be quite relevant. A recent follow-up in Zhivlov (2014) has abandoned the idea.
[4] He has presented some detailed critique against the reconstruction of *ë (“Vokaaliston kysymyksiä”, 1988, Virittäjä 92 pp. 325–329), though it seems this never led to much further discussion of the matter, and after Itkonen’s death in 1992 no one else seems to have had much interest in defending his system of vowel reconstruction.
[5] An alternate reconstruction *tejɜ- would also work for the 1st syllable vocalism, but this would predict a vowel-harmony-compliant **tidVw > ˣtüdő in Hungarian.
[6] It would be possible to hypothetize e.g. that inherited *ë had already been split to Old Hungarian *i vs. *a at this date, and that *oj first yilded not *ëj, but rather *ej, which was later assimilated to *i; and that in ‘drone’, *j was then lost early, leaving a mid vowel. The Mansi evidence seems to support an earlier shift specifically via *ë, though.
[7] There are other words as well with a more limited distribution; cf. Honti 1982: 29–30. These words mostly feature an alternation between a base form with *-ëëɣ- and an oblique stem with *-aj-. I would assume that this *j was later generalized to the nominative in the body part terms ‘liver’ and ‘cheek’, which will only rarely occur as subjects.
[8] On a slightly off-topic note, I am not sure if the Southern Mansi long open stem vowels should be taken as original. They don’t seem to contrast with the corresponding short full vowels, and indeed, they correspond to short stem vowels in the other Mansi dialects. They also regularly condition shortening of 1st syllable vowels. I suspect some sort of a prosodic effect here: e.g. ˈV₁-V₂ > V₁-ˈV₂ when V₂ was a full vowel, followed by lengthening of the newly stressed V₂, and if applicable, shortening of the newly unstressed V₁.
[9] The shift *u > /ɨ/ seems to be regular in Udmurt before coda /ĺ/. Other examples include *kad₂a- > PP *koĺ- > *kuĺ- > kɨĺ- ‘to stay’ (cf. Komi koĺ-); *kod₂ka > PP *kuĺ > kɨĺ ‘disease, evil spirit’ (cf. Komi kuĺ); *neljä > PP *ńoĺ > *ńuĺ > ńɨĺ ‘4’ (cf. Komi ńoĺ). Contrast though retention before intervocalic /ĺ/ in muĺɨ ‘berry’, tuĺɨm ‘topmost yearly growth of tree’.
[10] Mari *ti is ambiguous: this could also derive from e.g. *täkV or *tikV. Samic *tikē is though probably an unrelated loan from Germanic (or perhaps from the same pre-Indo-European source as the Germanic words).
[11] These two might suggest a dissimilation *wo- >> *wëë- at first glance, but a counterexample is *woča > Fi. ota-va ‘fish trap’ ~ Ms *wooš ‘weir; fence; city’.

Etymologically opaque Votic words

For later reference, here’s a collection of etymologically opaque (to me) Eastern Votic words harvested from my new dictionary. I will not attempt any detailed analysis yet. (Presumably some investigation into Russian, Ingrian, Estonian, maybe even Latvian & German could turn up known cognates for many of these.)

  • aimo ‘carbon monoxide’
  • alëtsë ‘mitten’
  • hilkeä ‘ugly’ — if this is not a hypercorrect cognate of Finnish ilkeä ‘evil’.
  • hulkkuag ‘to travel’
  • hülpeä ‘disobedient’
  • ikolookka ‘rainbow’ — a compound based on lookka ‘bow, curve’, but the 1st element is unclear.
  • jahsaag ‘to take off shoes’ — does not seem related to Finnish jaksaa ‘to have energy for’.
  • kaaliag ‘to lick’
  • kaputta ‘sock’
  • kineri ‘melted fat’
  • koltši ‘old-fashioned ladle’
  • kosma ‘hair’
  • lainatag ‘to swallow’ — does not seem related to Finnish lainata ‘to borrow’.
  • lautta ‘cowshed’ — does not seem related to Finnish lautta ‘raft’.
  • liblo ‘oat awn’
  • linnaasëd ‘malt’
  • lohko ‘soup’
  • lühtši : lühdže- ‘pail’
  • läntü ‘milk’
  • mauttši ‘intestine’
  • naka ‘cask spigot’
  • nakliska ‘some part in a sleigh’ (“the informant is unable to explain what exactly”)
  • nëikko ‘rockable cradle’
  • nättšelikko ‘burdock’
  • nätši ‘uncooked (of bread)’
  • nättü ‘rag’
  • ootava ‘cheap’
  • pallo ‘pigeon’
  • pelssimed ‘loom’
  • peltta ‘leftovers of threshing’
  • pihta ‘shoulder’
  • pilpa ‘dandruff’
  • pärähmä ‘fathom, armful’
  • raaka ‘twig’
  • ramitsaag ‘to limp’
  • ratiz ‘granary’
  • rehnüüz ‘entrance hall’
  • rehtilä ‘griddle’
  • ringuttaag ‘to stretch’
  • ripa ‘footwraps’
  • ripila ‘fireplace poker’
  • rooppa ‘porridge’
  • rootšiag ‘to dig, to rummage’
  • ruttaag ‘to hurry’
  • śalko ‘foal’
  • servä ‘edge’
  • sippelikko ‘ant’
  • sisava ‘nightingale’
  • sultsiag ‘to wash’
  • surmukaz ‘relative’ — probably not derived from surma ‘death’?
  • säblä ‘kitchen hook’
  • šitinka ‘bristle’
  • šlotta ‘slush’
  • taari ‘ale’
  • tahtši : tahdžë- (!) ‘chaff’
  • tauttaag ‘to take’
  • tiheh ‘mosquito’
  • turvaz : turpaa- ‘ladder’
  • tuutikko ‘washbundle’
  • türü ‘food comprising breadcrumbs mixed with milk or water’
  • tšiutarë ‘coldroom’
  • tšiutto ‘shirt’
  • tšäppeä ‘beautiful’
  • uhër ‘auger’
  • unka ‘wooden cup’
  • upa ‘bean’ = Est. uba.
  • ursi : urtë- ‘bed curtain’
  • vaattaag ‘to look’
  • valo ‘dung’
  • varo ‘hoop’
  • veelatag ‘to soak’ — compound with vete- : vee- ‘water’?
  • vokki ‘spindle’
  • väitšiäg ‘to call’
  • ördžähtäässäg ‘to wake’
A potential Turkic-Yukaghir loanword

A project I am working on and off is compiling lexical parallels that have been proposed in connection to various proposed external relationships of Uralic. Occasionally this kind of work turns up nice new etymological insights.

One of the best-retained — and also one of the more specific — verbs of motion reconstructible for Proto-Uralic is *kälä- ‘to wade': reflected in e.g. Northern Sami gállit ‘to wade’, Finnish kahlata ‘to wade’ (an old loan from Samic), and Hungarian kel ‘to rise’. (The meaning ‘to rise’ is found also in Mansi and Khanty; the latter also has ‘to step up on land’.)  This has been compared with the Yukaghir verb *kel- ‘to come’. The pairing is phonetically OK, but semantically it does not seem impressive. It might be acceptable if a relationship between Uralic and Yukaghir were already established, but it offers hardly any evidence for a relationship in the first place.

Interestingly enough, the same Uralic verb has also been compared with Turkic *gel- ‘to come’ — with the exact same semantics and an equally compatible phonetic shape! (E.g. already Björn Collinder in Fenno-Ugric Vocabulary, 1955/1977, reports both comparisons.) Probably the first step here should be to analyze the Yukaghir word as a loan from Siberian Turkic, and worry about any possible Uralic relationships later.

I would predict that pitting the Uralo-Yukaghir and Ural-Altaic hypotheses against each other may turn up further cases like this where a straightforward loan etymology is available. It’s already been noted by Rédei in his “Zu den uralisch-jukagirischen Sprachkontakten” (1999, in FUF 55) that many of the Uralic-Yukaghir lexical parallels extend to some of the “Altaic” languages as well…

Statistical etymology: A Votic example

I have last Friday picked up a dictionary of the Mahu dialect of Eastern Votic (Castreanianumin toimitteita 27, 1986), based on Lauri Kettunen’s collections from about a hundred years ago. [1]

This is not a particularly huge book, with only about 150 pages of lexical data, set in a relatively large monotype font, too. It probably won’t be of much use if one wished to e.g. translate Firefox into Votic. Its usability as tourist dictionary might be limited as well (even if we ignore the sad fact that Votic is hard moribund, with only some dozens of speakers left). But it seems like a good reference for a linguist wishing to make some contact with the language. Or: a handy unit of data for a linguist wishing to understand the lexical structure of languages.

The lexicons of natural languages are not random in their makeup. Phonemes have differing frequencies of occurrence in different positions of words; and different tendencies of combining with each other. And although one can certainly find linguists who will attempt to offer explanations in terms of elaborate synchronic phonological constraints and preferences, I find this a fundamentally flawed approach. [2] Much more often, any patterns evident in the lexicon are best understood as the fossilized results of historical processes: sound changes, loanword strata and evolving standards of sound-symbolic conventions. The study of a language’s lexicon even at a single point in time will likely turn up insights into its history.

For this type of analysis, this Votic dictionary actually seems like a rather good sample size. The lexicon of any major literary language would be both overwhelming in size (possibly thousands of pages); as well as swamped with recent cultural loanwords (if you happen to find a word shaped approx. like /banana/ or /platinum/ in a given language, this will not tell you much about its prehistory). Neither of these problems is apparent here, and it’s possible to focus on the big picture without getting stuck on data wrangling. On the other end, a simpler list yet of say 100 words, whether artificially truncated or recorded in passing in 1820 from some now-extinct language, would not allow for many statistically significant conclusions at all.

A simple starter example: the Finnic languages have, originally, not contrasted voicing in obstruents (as was the case already in Proto-Uralic). This situation still remains in place in Estonian, Northern Karelian, and dialects of Finnish. Votic, however, sits on the side of the siblings to have fully embraced voicing, and contrasts voiced and voiceless versions of all obstruent consonants: /p t tš k f s š/ ≠ /b d dž g v z ž/. Suppose we were to hand a copy of this dictionary to a linguist who’s never worked with Finnic before. Will they be able to uncover this older constraint?

The answer seems likely to be “yes”. Only minor etymological analysis is required — which the dictionary itself provides, even. The lexemes in the dictionary are glossed in both Russian and Finnish, the two major contact languages of Votic. Additionally, several words identifiable as recent Russian loans are indeed so marked. This allows an initial separation of the lexicon to two mostly disjoint layers: those of Finnic vs. Russian background. (Though of course Finnish has some Russian loanwords as well, and small amounts of words whose origin is not immediately obvious can also be found.)

A look at words beginning with voiced obstruents other than /v/, as well as words beginning with /f/ shows that they, as a rule, belong in the Russian layer. This is a small set to begin with, and after this cleanup, no more than seven counterexamples remain:

  • balalaittaag ‘to gossip’
  • bëëg ‘isn’t’
  • borissag ‘to bubble’
  • bulissag ‘to bubble’
  • börö ‘ironing board’
  • däädi ‘some relative’
  • filissaag ‘to whistle’

So we have four onomatopoetic verbs, one unstressed particle, one nursery word, and one fully legit content word. This is not sufficient evidence to postulate the voicing contrast to be original in the initial position, not when evidently inherited words beginning with /p t tš k s v/ number multiple hundreds altogether. [3]

A more detailed examination would find that medial voiced consonants other than /v/ can similarly be shown to be secondary — they occur as the consonant gradation alternants of the voiceless ones. Exceptions, as a rule, again occur only in Russian loans and probably some onomatopoeia. The full details would be more difficult to dig up though, so I am leaving this as an excercise for the interested reader. ;)

[1] In case anyone else is interested, some overflow stock of these from dunno where is still up for grabs at the University of Helsinki’s Dept. of Finno-Ugric Studies (Metsätalo/Unioninkatu 40, 4th floor).
[2] This may not be an entirely fair comparison, but… I have in mind the image of a “generative geologist” attempting to locate physical constraints present in gneiss or sediment that force its minerals to hold a macroscopically banded rather than homogenous structure.
[3] I will not dwell on /š/, also mainly a loanword phoneme.

Interplay of minor soundlaws: Samic glide clusters

Shifting and widening my scope a little, here’s a look into the history of two consonant clusters across the Samic languages as a whole.

The two-glide cluster *-jv- is a simple place to start. The development of this is straightforward: this is retained essentially intact everywhere across all Sami varieties. (If you want to have a look for yourself, I am including copious links to the Álgu database in this post.) Possibly the coda *j may have been vocalized into the 2nd component of a diphthong/triphthong, but this is basically trivial.

A small further complication still comes up in Southern Sami. Two words have here a seemingly irregular *-jj-: *oajvē > åejjie “head, end”; and *peajvē > biejjie “sun; day”.

Both of these happen to be are inherited Uralic words, with cognates stretching all the way to Samoyedic. So my first reflex was to go “a ha! does this mean that the words showing /jv/ are therefore newer loanwords?” The answer is “no”, though: at least *koajvō- > gåajvodh “to dig” is of equally ancient pedigree. But I think I can dial this hypothesis back a little. Perhaps the shift *jv > *jj occurred due to the following front vowel *-ē (in Southern Sami characteristically diphthongized to /ie/ even in the 2nd syllable). This seems phonetically plausible & drops the number of counterexamples from half a dozen to one: *vājvē > vaejvie “pain”. This last word is in turn a known Finnish loanword, which may have indeed diffused into Southern Sami at a late date.

This idea seems to be preliminarily further supported by an interesting derivative of “head”: åajvadidh “to advice”. My chops in SS historical morphology are insufficient to present an implicit PS reconstruction, but we can clearly see here at least a retained stem vowel *ā, a regular feature before 3rd syllable *ë; in other positions this was further raised to *ē already in PS. And before this lower vowel, *-jv- survives after all.

Now let’s consider the opposite PS cluster *-vj-. This turns out to have had a much more complicated history.

Three Sami varieties have completely regular development. Lule Sami and Ter Sami have in all involved words metathesized this cluster, merging it with *-jv-. Inari Sami has always retained /vj/. Northern Sami also might belong here, depending on who you ask: the Álgu database claims /jv/ in a single word *sāvjë > sájva “isolated lake”, while my copy of Yhteissaamelainen sanasto presents sawˈjâ (= equivalent to sávja in the current NS orthography). I would guess that there are dialectal differences involved? FWIW, Sammallahti in The Saami Languages claims that at least the Torne Sami dialect group “originally” belonged with Lule Sami rather than Northern Sami. [1]

In a couple of other varieties, it is also possible to state a mostly applicable rule. Pite Sami aligns with its sibling Lule in having -jv- everywhere except in *jēvjë > jievja “white reindeer”; while Skolt and Kildin Sami align with Inari Sami in having -vj- everywhere except in *ćōvjë > Sk čuõivâk, K čuəivex “grey reindeer”. Probably these sorts of exceptions again represent loaning from neighboring dialects. [2]

Southern Sami again shows a few more complications; as does the neighboring Ume Sami. Covering SS first, metathesized /jv/ occurs in two words: *tāvjā > daajvaj “often”, *sāvjë > saajve “gnome”. Unmetathesized /vj/ is found in three: *jēvjë > joevje “light grey reindeer”, *jēvjë- > joevjeme “beard moss” (don’t ask me what’s the oe doing in these), *vōvjē > vuevjie “wedge”. Lastly, an assimilated /jj/ is found in *ćoavjē > tjåejjie “stomach”. This appears to confirm the assimilation rule I proposed in the 1st section: v > j / j_ie. Provided that we assume the metathesis *vj > *jv to have occurred before this…

The Ume Sami reflexes seem to support this last assumption. Although not many of the involved words have been recorded from here, /jv/ is found in those lexemes that have SS /jv/ ~ /jj/: dàivài “often” and tjåìvee “stomach” — while /vj/ is found in those that have SS /vj/: jauja “grey reindeer”, vyöyjee “wedge-shaped patch”. There is also one word with a somewhat baffling three-glide reflex: guyvjas “grey reindeer” (with unetymological /g-/ to boot). [3]

How should this distinction between S+U-metathesizing and S+U-unmetathesizing *-vj- be accounted for? Could this be etymological somehow? An interesting fact is that *vōvjē “wedge” is one of the Samic words showing lenition of original coda *k before sonorants (as shown by the Finnic cognates: e.g. modern Finnish vaaja, Karelian voakie, Livonian vaigā < PF *vakja). [4] So, perhaps this change occurred only after the metathesis of inherited *vj to *jv in Southern and Ume Sami? A late date for the change has already been suspected:

This sound change cannot be reliably dated, but it may well have taken place during a relatively late phase of Proto-Saami.

(Aikio 2006: 3.11 §) [5]

With this interpretation, a “maximally hereditary” chronology would be:

  1. Lenition *kj > *ɣj in Finnish.
  2. Samic *tāvjā “frequent” is loaned from Finnish *taɣja.
  3. Metathesis *vj > *jv in South & Ume.
  4. Lenition *kj > *vj all across Samic.
  5. Samic *jēvjë “white reindeer” is loaned from Germanic.
  6. Assimilation *v > j / j_ie in South. — Metathesis *vj > jv in Pite, Lule & Ter. — Raising *eu > *iu in Germanic.
  7. Samic *vājvē “pain” is loaned from Finnish vaiva.

…But is it a good idea to attempt maximizing the degree to which various Samic words would have been inherited from a common ancestor? I think it is important to keep in mind that fresh loanwords readily diffuse across dialect continua.

As for the particular downsides of the abov scenario, at minimum I am uncomfortable assuming that the specifically Finnish change *kj > *ɣj occurred earlier than the supposedly Proto-Germanic change *eu > *iu / _j. [6] OK, it’d be possible to go on making some cleanup assumptions; e.g. that in the numerous newer Germanic loans in Finnish where *ɣj can be reconstructed, this was substituted for original *kj; or perhaps, that the /k/ ~ /g/ found in the other Finnic languages would be a reversal from *ɣ; but this would all be for no other reason than ensuring a Proto-Samic ancestry for SS daajvaj, US dàivài. We could instead assume that S+U acquired these words from the direction of P+L, and show /jv/ for this reason.

This should also call into question whether my step 3 above existed at all. *sāvjë “gnome” (elsewhere in Samic also with meanings like “underground water”, “lake with an underworld entrance”, “isolated lake”) seems like a potential cultural loan from the P+L direction at least. It is of Germanic ultimate origin, but seems to have acquired its mythical flavor only on the Sami side: the PGmc root is simply *saiwiz “lake”.

Note moreover that this loan etymology actually predicts PS *-jv-, not *-vj-! And yet there is no evidence for the inverse metathesis *-jv- > *-vj- to have regularly occurred in any Samic variety. So are we therefore forced to furthermore conclude that this word was originally adopted specifically in the Pite/Lule area, and hypercorrectly metathesized to *-vj- when loaned eastward from these varieties? The Southern /jv/ could similarly also turn out to be original after all.

This leaves just the question of *ćoavjē “stomach”. Relationship to Samoyedic *t¹äjwə “stomach” has been proposed. The initial consonant, vowel frontness, and glide cluster order all fail to match, though, so I suspect this is only an accidental resemblance. I could just as well propose that the Samic word is a metathesis from something like earlier *voaćjē, and therefore related to Finnic *vacca “stomach”? (Ha ha.) With the case for inheritance being in this shape, I don’t think it would be too much of a problem to assume that here, too, the S+U forms have been loaned from the direction of P+L. — But still early enough to have participated in cluster smoothing in SS, apparently.

An additional topic to ponder at this point would be the motivation of the metathesis *-vj- > *-jv-, which altogether appears to be attested in at least two widely separated parts of the Samic dialect continuum. Pite and Lule Sami are spoken in northern Sweden and adjacent areas of Norway (also Finland if we count Torne Sami), Ter Sami at the eastern end of the Kola peninsula. It seems unlikely that these groups have been in any direct contact with each other since Proto-Samic times. It also seems unlikely that this incredibly specific metathesis was purely coincidentally innovated in both. One possibility might be some kind of a phonological precondition for this change having existed already in Proto-Samic, which in only two areas led to the change running to completion?

A better solution though might be a common external source. This exact same metathesis happens to be known furthermore from the Finnic languages! Late Proto-Finnic allows no *-vj- (or *-Vuj-/*-Vüj-: we are better off reconstructing diphthongs rather than coda glides at this date), and although no words with PU *-wj- have been retained in Finnic, a number of loanwords allow reconstructing a metathesis here. E.g. PGmc *flauja- → Finnish laiva “ship”. [7] Metatheses of some other similar clusters including older *-wr- (PS *jāvrē ~ LPF *järvi “lake”) are also found, which suggests that this type of change originated in Finnic, and might have been in the case of *-vj- > *-jv- passed on to Samic.

Still, why just these specific varieties? The Lule Sami probably had numerous connections with Finnic traders and settlers in the Torne Valley and adjacent areas since a much older period than the Finnmark/Inari/Skolt/Kildin Sami living further inland, that much is clear. Yet should we expect this shift to have therefore also been also present in the extinct “southeastern” Sami varieties such as the marginally attested Kemi Sami?

Particularly difficult to understand is Ter Sami. I do not think we even know at present whether the Kola Sami languages developed entirely in situ, or if they may have spread to Kola from e.g. the southern reaches of the White Sea, some of their characteristic features already in tow? The presence of this sound change might demand, at minimum, for Ter to descend from a dialect that was originally spoken further south than the corresponding ancestral dialect of Kildin…

[1] One wonders how and why may we claim that it no longer does; or whether we are to conclude that “Northern Sami” is an areal entity rather than a genealogical one.
[2] I wonder if these last two words have some relation to each other. The semantic closeness is obvious, and the consonant skeletons are quite similar as well. The proposed etymology for *jēvjë is loaning from (pre-?)Proto-Germanic *heuja- “hue”, and the Germanic *h- moreover comes from PIE *ḱ-. Wiktionary mentions here e.g. Lithuanian šývas “white”. Could any of the Satemic cognates have plausibly been loaned to yield pre-Samic *ćawjə or *ćowjə “grey”?)
[3] Or could this indicate a substitution *ḱ- >*k-, from some non-Satem variety? Perhaps not, since this would be chronologically problematic and there are other known examples of irregular *ć- > *k- in some varieties of Sami.
[4] Also by the word’s etymology as stemming from Baltic: cf. Lithuanian vagis, Latvian vadzis. For more details cf. Itkonen, Terho (1982): Laaja, lavea, lakea ja laakea. In: Virittäjä 86.
[5] Aikio, Ante (2006): On Germanic-Saami contacts and Saami prehistory. In: SUSA 91.
[6] I actually suspect this was “only” Northwest Germanic, given how Gothic shifts *e to *i always anyway. More details to come on this point later though. At any rate this would still not be a huge chronological relief.
[7] For further details cf. Koivulehto, Jorma (1970): Suomen laiva-sanasta. In: Virittäjä 74.

Notes on Eastern Sami vowel history, part 2

(← Part 1)

For initial details: a few complications involving *i and *ë.

In the Kola Sami branch (Kildin & Ter Sami), the default reflex of PS *i seems to be /ï/. (I dunno if this is [ɨ] or [ɯ], though I’m relatively sure that this isn’t really relevant.) E.g.:

  • *ćimpē > K čï´mmb, T čïḿḿb́e “shin”
  • *ijë > K ïjj, T jïjj “night”
  • *kikë- > K kïggeð, T kïkkɐd “to rut”
  • *kirtē- > K kï´rrdeð, T kïŕŕďed “to fly”
  • *nisōn > K T nïzan “woman”
  • *piksë > K pïxxs, T pïkks “bird’s sternum”
  • *pirë > K T pïrr “around”
  • *rissē > K rï´ss, T rïśśe “twig”
  • *silpë > K T sïllb “silver”
  • *tikkē > K tï´kk, T tïx́x́ḱe “tick”
  • *vitë > K vïdd, T vïtt “five”

There seems to be one regular exception to this development: /i/ remains after *ń-. The cases are:

  • *ńiŋēlës > K ńiŋŋlȧs “female”. Lehtiranta reconstructs this with initial *n-, but variation might go back to Proto-Samic; Pite, Northern and Skolt Sami also have /ń-/, while Ume, Lule and Inari Sami have /n-/.
  • *ńińćē > K ńi´ńńdž, T ńińńdže “teat”
  • *ńipćōs > K ńipčas, T ńipčs “roasting spit”

Not a whole lot, but this makes good phonetical sense, and there seem to be no counterexamples.

Another environment where /i/ comes up frequently is before coda *j. However, this is not fully regular, and given that PS *-ijC- only occurs in loanwords from Finnic & Scandinavian, analyzing these as post-Proto-Samic loanwords adopted after the change *i > *ï seems preferrable:

  • K ki´jjteð, T kijjtad “to thank” (← Finnic *kiittä-)
  • K li´jjg “excess” (← Finnic *liika)
  • K ni´jjb, T nijjb́e “knife” (← Scand.)
  • K T sijjd “village” (← Scand.)

The expected /ïj/ is still found in three words (and also cf. *ijë “night” above):

  • *lijnē > K lï´jjn, T lïjjńe “linen”
  • *rijtō > K rïjjd “quarrel”
  • *tijmā > K T tïjjma “last year”

So far, so good. But let’s kick it up a notch. PS *ë, as I mentioned before, has a variety of differing reflexes across the Eastern Samic varities. All of the main ones are some flavor of open-to-mid, back-to-central. For Kola Sami, a representative selection would be:

  • *mënë- > K mëënneð, T mɐnnɐd “to go”
  • *nënōs > K nȧnas, T nɐnas “strong”
  • *tënē > K tȧ´nn, T tɐńńe “tin”

Before PS *ń- and *j-, though, several words point to *i across Eastern Samic. Lehtiranta lists 7 roots beginning with the sequence *jë-, and 10 with *ńë-, that are found in ES. 8 of these have *i-like reflexes in at least one ES variety. This does not seem like a coincidence — similar cases in other consonant environments, including before *ć-, are very rare.

The regular cases are:

  • *jëlkëtē > Inari jolgad, Skolt jõlggâd “flat”
  • *jëllë > I jolla, Sk jõll “crazy”
  • *jëlŋēs > I jalŋes, Sk jââ´lnjes, K jȧ´lŋes, T jɐĺĺŋ́eś “tree stump”
  • *jërŋë > I jorŋa, Sk jõrŋŋ, K jëërn “open water”
  • *jëskë(tē) > I joska, Sk jõskk, K jëëskeð, T jɐsskɐd “quiet”
  • *ńëðē- > I njađđeeđ, Sk njââ´đđed, K ńȧ´ddeð, T ńɐťťed “to affix together”
  • *ńëlë- > I njoollađ, Sk njõõllâd, K ńëlleð “to debark a tree”
  • *ńël-tē- > I njaldeđ, Sk njâ´ldded, K ńȧlldeð “to peel” (a derivative of the previous)
  • *ńëvē > I njauve, Sk njââ´vv, T ńɐv́v́e “rapids”

The seemingly irregular cases are:

  • *jëkē > I ihe, Sk ee´ǩǩ, Ki ï´gg, T jïḱḱe “year”
  • *jëŋë- > I iiŋŋađ, Sk iiŋŋâd, K ïŋŋeð “to dry”
  • *ńëckē- > I njiskođ; — but Sk njõõcksed, K ńȧ´ckseð “to scrape (off)”
  • *ńëkē- > I njihe-, Sk njee´ǩǩ-, K ńï´gg-, T ńïkke- “slanted”
  • *ńëkkē(ńë)- > I njihanjas, Sk njikknâsted, K ńiggnȧ´steð, T ńïx́x́ḱed “to hiccup”
  • *ńëmë- > Sk njiimmâd, K ńïmmeð, T ńïmmɐd; — but I njommađ “to suck”
  • *ńëncē- > I njiʒʒed, Sk nje´ʒʒed; — but K ńȧ´nndzeð “to rip off”
  • *ńëvlē > I njivle, Sk njeu´ll, K ńi´vvl “slime”

There are actually some hints for the conditioning of the split here. After *ń, *ë mostly remains low in original open syllables, vs. is reflected as more close/front in original closed syllables. Vowel length in Skolt seems like an even better indicator that allows also understanding “to suck”. Hence it seems that this change is related to the secondary vowel lengthening that I mentioned last time: only short *ë is palatalized to *i, while lengthened *ë remains. The lack of raising in *ńëltë-, then, might be due to the derivational relationship to *ńëlë-.

Bizarrely, the situation seems to be the inverse for *jë-: going again per Skolt, the lengthened cases are raised/fronted, while the short cases remain.

Furthermore, the interaction of this phenomenon with the previous one does something weird: the change *ńi > /ńi/ fails to occur in several Kola Sami words in this 2nd group (“slanted”, “to suck”, partially “hiccup”)! This is quite mysterious. To route these words in as regular developments, we’d have to assume that *ńë > *ńi only happened after *ńi > /ńi/ — but also before the change *i > /ï/. That is, *ńi > /ńi/ would not represent a simple absense of sound change, but instead some sort of a shunt to a different vowel altogether, later booted back to regular /i/?!

Perhaps these are actually cases of etymological hypercorrection. Suppose that the words with /ńï/ are not actually inherited in Kola Sami, but were loaned from Skolt (or some other eastern but non-Kola variety)? If so, the speakers could have latched on to the usual pattern /i/ : /ï/ and overgeneralized here. Several examples are known of this kind of process even between Samic and Finnic; why not also between the individual Sami varieties?

 The phonetic nature of the change *ë > *i / {j, ń}_ is interesting too. PS *ë is generally taken to derive from earlier *i (< PU *i, *ü), while at this timeframe *i would have been a long vowel *ī. In this light, I wonder if the palatal assimilation was actually sufficiently early that *ë was still hanging somewhere around the front vowel region, e.g. as [ɪ], and the change amounted to simple raising? If the assimilation operated on an already retracted vowel like [ə] or [ɤ] or [ʌ], I’d rather expect the result to have been a mid front vowel like [e]. — OTOH the Samic languages do seem to be somewhat “allergic” to this sound, so raising all the way back to /i/ does not seem entirely out of the question.

Notes on Eastern Sami vowel history, part 1

Recently I sat down with my copy of J. Lehtiranta’s Proto-Samic dictionary, Yhteissaamelainen sanasto (1989; SUST 200) to work out the development of the vowel systems in the Eastern Samic languages. I do not know if this has been done before; it might have, though I am not exactly worried about rederiving results. [1] At minimum this topic is absent from Sammallahti’s handbook The Saami Languages (1998). His historical phonology appendix covers at length only the evolution from Proto-Uralic to Proto-Samic, and from there to Northern Sami. In the main chapters, too, he only mentions a handful of innovations for Eastern Samic (that he deems diagnostic for defining its taxonomy). Yet it’s obvious that there’s been much divergence going on here: cf. e.g. *kōlē “fish” > Inari Sami kyeli, Skolt Sami kue´ll, Kildin Sami kū´ll, Ter Sami kïĺĺe. [2]

The following fairly general features stand out:

  • The umlaut tendencies that already must have begun in the Proto-Samic era have continued wildly. Most vowels have distinct reflexes before each of the three common PS stem vowels: *-ë, *-ē, *-ō. (Since Lehtiranta only lists citation forms of words, I don’t have much idea what the effect of PS *-ā, *-i, and *-u, which are rare outside of inflected forms, has been.) As usual, this must’ve been allophonic at first, but was later widely phonemicized by loss of unstressed vowels.
  • The mid vowels *ē, *ea, *oa, *ō, *o, *ë have the most varied reflexes. The close *i, *u are mostly unaffected (only Skolt has any umlauts going on with these), and *ā has not been majorly affected either.
  • Although not all languages distinguish all different umlaut “grades” of various PS vowels, I suspect umlauts for the most part regardless occurred in Proto-East Samic already, and that various languages have simply secondarily lost certain distinctions — since they seem to have done so in different ways. E.g. in Inari, *ë-ë and *ë-ō both > /o/, versus *ë-ē > /a/; but in Ter, instead *ë-ë and *ë-ē both > /ɐ/, versus *ë-ō > /o/. Skolt and Kildin distinguish all three types.
  • In addition to the umlaut splits, there also seems to be a vowel lenght split. There is of course no sign of this in Ter, where vowel lenght contrasts have been lost altogether; but it’s found relatively robustly in the other three languages. This seems regardless a little bit more like an areal phenomenon: lengthening in Inari almost always implies lengthening in Skolt, but Kildin corresponds poorer to these, and there are also cases where lengthening is found only in Skolt. As for conditioning, long vowels seem to be the rule of thumb before singleton medials, short vowels more general before two-stop clusters. This includes geminates, so the change must have been earlier than the strengthening of the strong grade of single stop consonants to geminates in Skolt. I’ve not worked out the conditions for other consonant clusters yet.
  • Skolt Sami seems to be altogether the Sami variety with the most complicated vocalism (though Southern Sami could give it a good run for the title). At its best, *ea has no less than seven different reflexes: eä, iä, iâ, iõ, ie, e, ee!

I do have a full correspondence table charted out, but further details shall come later once I’m done dubblechecking things.

All this has clearly had one important effect, though: loanwords seem to frequently “fail to keep up” with all the hair-thin split rules going on. Generally such cases seem to remain phonetically closer to the loaning language. It follows that such loans have to be dated as newer than Proto-Samic; indeed, possibly as newer than the splitting of all dialects in question. Even then, many such loanwords show a distribution across nearly all of the Samic languages. This seems to be another good demonstration of a point I think Uralic etymology needs to pay a lot more attention to: the “distributional principle” (“a word dates to the common ancestor of the languages it is found in”) cannot be trusted in the case of loanwords.

— There’s also one interesting feature that suggests some reinterpretation of the Proto-Samic vowel system. The *-ē-grade reflexes generally seem to be somewhat fronted, when distinct from the “unmarked” *-ë-grade reflexes (cf. e.g. “fish” above). On the other hand, *-ō has had a fairly general lowering effect, not so much a labializing one. This is only natural insofar as *-ō merges with *-ā in Skolt thru Ter. But it does remain a labial vowel in Inari. So what’s up with changes such as *pēŋkë > piegga “wind” vs. *pērkō > piärgu “food”; *mōrë > muora “tree” vs. *mōlōs > muálus “thawed water at shore”? And for that matter, Sammallahti notes that *ō caused also earlier lowering of PU *e, *o to PS *ea, *oa; [3] he posits a relatively open value [ɔː] for the vowel for this reason.

I now have formulated a different hypothesis. The etymological origin of *-ō is unclear — but most proposals have involved a coloring of PU *-a in some fashion. However! If there was indeed a change *-aw > *-o, perhaps this should be postdated to the dialectal Sami era. The following chronology seems to have potential:

  1. Late Proto-Samic: 2nd syllable *-a > *-ā generally changes to PS *-ē, but remains in PS stems of the shape *-āw.
  2. After the W/E split: Secondary *ā-umlaut in Eastern Sami.
  3. After further dialectification: *āw coalesces to *ō in Western + Inari Sami; but merges with *ā in Skolt + Kola Sami.

Of course, this would require looking into the consequences. One issue is that Proto-Samic had not only the traditional *-ō-stems, but also the class of *-ōj-stems. How should these be reconstructed in this system? I don’t think anything with front rounded glides (*-āẅ?!) would work, since PS had eliminated front rounded vowels from its phonology. Maybe *-āwjV?

 Followups: Part 2

[1] If anything, I consider this a much better way to get a hang of known results than just reading about them from a reference book. Also, I did this kind of a survey on Livonian once before and that ended up with me making a couple of discoveries that have by now grown to a draft paper.
[2] Or, supposedly, with “light” palatalization (UPA subscript half ring), not “heavy” (UPA superscript acute). I’ve seen the similar contrast of “palatal prosody” vs “segmental palatalization” in Skolt Sami transcribed as one of secondary palatalization [tsʲ sʲ nʲ  lʲ] vs. full palatality [tɕ ɕ ɲ ʎ] though — and given how the UPA is surprizingly terrible at representing primary palatals, I’m guessing this is the case for Ter Sami as well. Especially since both languages lack a “heavily palatalized ŕ” where expected, which squares well with how palatal trills are physiologically impossible.
[3] Actually [ɛː], [ɔː] according to him. I have a couple reasons to think these may have diphthongized already early on, though; but that’s ever so slightly off-topic for this post…

Proto-Yukaghir voiced stops (and their implications)

One of the more popular proposals for external relationships of the Uralic family is the Uralo-Yukaghir hypothesis. By certain measures it might even count as the most popular one. The idea has been around for a long while, but in an infuriatingly entrenched state, with views divided between mainstream specialists dismissing everything as speculation, vs. macro-comparativists and several outsiders taking the relationship as more or less granted. [1] E.g. from the humbler and more “professionally credible” end of the latter group, consider Michael Fortescue’s 1998 monograph Language Relations Across Bering Strait: the book makes no attempt to explore the possibility of any Uralic/Yukaghir similarities resulting from anything but genetic inheritance. This is a particularly jarring omission since he does still cover other contact influences relevant to his idea of relating Uralic, Yukaghir, Chukotko-Kamchatkan and Eskimo-Aleut: those between Y + CK, CK + EA, and even between the individual branches of CK and EA.

Research into the hypothesis seems to be finally picking up these days, though. Much of this must have been enabled by Elena Nikolayeva’s ongoing work on the Yukaghir side, culminating in her 2006 monograph, A Historical Dictionary of Yukaghir. After an apparent latency period of diffusion and digestion, a bunch of new views on U/Y relations have emerged here in Finland within the last few years in particular:

  • Häkkinen, Jaakko (2012): Early contacts between Uralic and Yukaghir. [Appendix.] In: SUST 264.
    — An attempt to model lexical correspondences as several strata of loanwords, and to determine what this would imply for Uralic and Yukaghir prehistory in geographical and archeological terms.
  • Piispanen, Peter S. (2013): The Uralic-Yukaghiric connection revisited: Sound Correspondences of Geminate Clusters. In: SUSA 94.
    — A more optimistic take, presuming a relationship and suggesting some new lexical comparisons requiring rather wild new soundlaws.
  • Luobbal Sámmol Sámmol Ante (Ante Aikio): The Uralic-Yukaghir lexical correspondences:
    genetic inheritance, language contact or chance resemblance? [Preprint.] To appear in: FUF 62.
    — A detailed, conservative review, suggesting that the currently known material is too scarce to establish regular sound correspondences, and that therefore many lexical comparisons may turn out to be simply accidental similarities.

According to the word on the grapevine, there is also at least one further paper in the works on the topic.

I have yet to subscribe to any particular hypothesis on the topic (though of course a burden of proof should lie on those claiming a particularly close U/Y relationship). But it seems to me any assessment of the situation is going to strongly depend on our general understanding of Uralic and Yukaghir prehistory. One of the aims of my various ongoing work on Proto-Uralic is indeed to allow better assessing the various external relationships that have been proposed. I present here one proposal for amending Proto-Yukaghir as well.

The presence of voiced spirant consonants (at minimum *ð, *ɣ) have been listed by Fortescue as one of the better phonological markers of his “Uralo-Siberian” group of language families. The phonetic character of at least the Proto-Uralic “spirants” is however anything but clear… And on closer examination, I believe that for Proto-Yukaghir they’re probably a mistaken assumption.

The modern Yukaghir languages — Kolyma Yukaghir and Tundra Yukaghir — do not have any systematic series of voiced spirants. These only show up in Proto-Yukaghir as reconstructed by Nikolayeva. She posits PY word-medial *w, *ð, *ɣ [2] behind the following three sound correspondences:

  • Kolyma /b/ ~ Tundra /w/
  • Kolyma /d/ ~ Tundra /r/
  • Kolyma /g, ʁ/ ~ Tundra /g, ʁ/ (depending on the PY vowel backness)

This is not an immediately obvious reconstruction. Several changes are required here to derive the modern sound values: across-the-line spirant fortition in Kolyma, rhotacism of *ð + sporadic fortition of *ɣ in Tundra. It seems to me it would be more parsimonious to reconstruct here PY voiced stops *b, *d, *g (~ [ʁ]), and to assume only the lenition of *b and *d in Tundra. Note also that the change *d > *r can easily occur directly, without any intermediate *ð stage.

*w is reconstructed also word-initially for Proto-Yukaghir: again reflected as Tundra /w/, but instead lost in Kolyma. This is an odd asymmetry. Normally, glide or spirant fortition is more likely to occur word-initially — for example cf. Spanish and Selkup. [3] On the other hand, *b is not a consonant that is commonly lost word-initially, so reconstructing that here, too, would not help either. I suggest accepting the asymmetry instead of trying to explain it away: reconstructing initial *w- but medial *-b-. This state of affairs still technically allows identifying these two as the same proto-phoneme — which would provide a motivation for my newly assumed shift *b > /w/ in Tundra (and yet not *g > ˣ/ɣ/, which is a more common 1st step in voiced stop lenition chainshifts).

Perhaps there was also an earlier original word-internal *-w-, which was vocalized/lost in all attested Yukaghir varieties; either already in Proto-Yukaghir, or even slightly later on, in which case it might explain some of the numerous irregular vowel correspondences between Tundra and Kolyma.

The history of PY consonant clusters can furthermore be streamlined here. Nikolayeva sets up a set of nasal + voiceless stop clusters such as *mt, *ŋć, *ŋk, and has to assume later voicing to yield the actually attested /md/, /ŋď/, /ŋg/, etc. However, if voiced stops and not spirants are posited for PY, they can easily be reconstructed here as well. Nikolayeva also reconstructs liquid + stop clusters, and notes that the stops “mostly” remain unvoiced in these; yet with some exceptions. It seems these “exceptions”, that correlate neatly between Tundra and Kolyma, could have been in place already in Proto-Yukaghir.

The overall phonotactic pattern here — voiced stops that are restricted to word-medial positions and only contrast with voiceless stops between vowels (and, perhaps, after liquids?) — still suggests that some pre-Yukaghir stage only had voiceless stops; which were then voiced in some medial positions; followed by the introduction of new medial voiceless stops from some secondary source (e.g. geminate voiceless stops, loanwords). Some variation of this history has occurred widely among the Uralic languages, for one. But this is no reason to assume that the change is recent! Dialects of Mokša and Mari have resisted initial voiced stops in loanwords until fairly modern times (18th-20th century), despite medial voiced stops having existed already in Proto-Mordvinic and Proto-Mari times (somewhere around the 1st millennium CE).

Lexical correspondences with the Uralic languages also appear to support this model. I will refer here to Proto-Yukaghir roots by their index numbers in the Historical Dictionary, following Aikio’s paper linked above (it includes a useful appendix of Nikolayeva’s U/Y comparisons).

Considering the labial consonants other than *m, three recurring patterns involving these seem to be attested:

  • PU *w ~ PY ∅ (#620, “tree” ~ “birch”; #1112, “vapor” ~ “smoke”; ? #2050, “to hear” ~ “sound”)
  • PU *(m)p ~ PY *w (#139, “older sister”; #1048, “warm”)
  • PU *pp ~ PY *p (#362, “sharp”; #1038, “to tear”; #2150, “to hit”)

Medial *-w-, *-p-, *-pp- are actually a fairly rare in PU, so even though some of the Uralic roots involved here are uncertain and there are some semantic differences, I find this a not quite trivial tally.

The correspondence *w ~ *w also seems to be absent (#806 “to leave” is a clearly rejectable comparison since the supposed “Uralic” root is a Germanic loan). While the material is scarce and so this could be an accidental gap, it seems regardless preferrable to interpret the material as reflecting the following developments:

  • (pre-)PU *w → pre-Y *w > PY ∅
  • (pre-)PU *(m)p → PY *b (voiced either in pre-Yukaghir or in some loaning Uralic branch)
  • (pre-)PU *pp → PY *p (shortened either in pre-Y or in some loaning Uralic branch)

…which also implies that we should indeed not expect any examples of the correspondence *-w-  ~ *-b- to turn up. [4]

Though this does not seem to generalize to the other POAs. There indeed do not seem to be any recurring correspondences involving intervocalic dental obstruents (or even more suspiciously, any comparisons involving *-t- on either side [5]); and the only recurring intervocalic velar correspondence is PU *x ~ PY *g (#1480, “guard” ~ “hunt”; #2599, “lead, take”). There is also one example each of *k ~ *g (#1302, “hill(s)”) and of *w ~ *g (#1019, “to eat”). These bring to mind the East Uralic development of *-k-, *-w- to *-ɣ-, which seems to suggest that if these comparisons are correct, they probably represent loans rather than inheritance.

Additionally, I wonder if the current issue has partly also been an issue of terminology. Nikolayeva’s model of the history of Yukaghir includes not only the Proto-Yukaghir stage, but also an “Old Yukaghir” stage, which would already have e.g. featured voiced stops in clusters. This is mainly used as a cover term for early historical records prior to the mid-19th century, but perhaps her underlying mental model in full detail actually looks like this:

Proto-Yukaghir > Old Yukaghir > dialectified Old Yukaghir > modern Kolyma Yukaghir & Tundra Yukaghir

Under this scenario, the 1st “Old Y.” stage would be the actual last common ancestor of the recorded Yukaghir varieties, while “Proto-Y.” would be an internally reconstructed entity. It would not be the first time a historical linguist were to abuse terminology in this way.

This is not a random guess. There are a couple other hints for this interpretation, e.g. the treatment of long vowels. Nikolayeva does not reconstruct these in certain positions where they do not contrast with short vowels, even though they appear in all records. She assumes that they must hence be ultimately somehow secondary even in other positions. This does not necessarily follow: consider e.g. Modern English, where “vowel length” (well, tenseness) fails to be contrastive in open monosyllables, in most dialects also before /r/. Regardless of this, and even regardless of numerous reconstructible processes of compensatory lengthening (e.g. light /laɪt/ ~ German Licht /lɪçt/), the vowel length contrast in English is absolutely ancient: it can be traced back all the way to Proto-Indo-European!

(English incidentally and probably coincidentally works as a typological parallel also for my idea that medial *-w- could have been lost earlier on while initial *w- still remained.)

Finally, I can’t help noticing that the long vowel issue and the reconstruction of spirants rather than voiced stops both swerve “Proto-Y.” typologically closer to standard-issue Proto-Uralic. Is this perhaps not an accident, but rather a general bias that has resulted from Nikolayeva’s working hypothesis of a Uralo-Yukaghir relationship?

[1] Incidentally I find it an interesting question why this particular hypothetical relationship is so pervasively accepted by Nostraticists and the like. There is no shortage of competing proposals, such as Indo-Uralic or Uralo-Dravidian; and neither does Uralo-Yukaghir have a history of recognition by the general public, unlike e.g. the Ural-Altaic or Uralo-Sumerian hypotheses. Is it perhaps that the relative obscurity of Yukaghir has made it more difficult to notice weaknesses of the idea?
[2] Yes, I am aware that /w/ is a semivowel, not a spirant, though frequently it may pattern as one (or, perhaps better: “isolated” voiced spirants may pattern as dental/velar glides).
[3] Even more so for geminate glides actually, with some precedents being North Germanic + Gothic (*ww > *ggw, *jj > *ddj ~ *ggj); Northern Sami (*jj > /dj/); Votic (*jj > /ďď/); various Prakrits including Pāli (e.g. *vv > /bb/); and several Berber varieties (e.g. *ww > /ggʷ/). This doesn’t seem to come into question here, though.
[4] There is a development *w > *b in most Samoyedic languages that could allow this, but being post-Proto-Samoyedic (absent from Nenets and Selkup), this might have been too late to be relevant.
[5] This is particularly curious since PU *-t- has, by contrast, Indo-European correspondences in abundance. Any macrocomparativist model that proposed common ancestry for all three, or even just for Y+U, would be hard-pressed to explain why Yukaghir has lost such words so consistently.

Email works again

Whoops. I noticed that the email alias I had been using on my About page no longer works (and might not have worked for a while). I hope this has not led to too many lost messages. :/

