Notes on Janhunen’s Law

(Part ca. 3 of n in my irregularly scheduled series of Introducing Named Soundlaws in Uralic Studies. [0])

The issue, as I see it

Most of the vowel correspondences we now think to be regular between Samoyedic and the rest of Uralic are those that were outlined by Janhunen in 1981. The actual sound laws behind them have regardless often gotten re-tooled or re-dated by now, much in the same way how many of them already had earlier precedents in some form (primarily from Lehtisalo or Steinitz). E.g. the chainshift *e > *i, *ä > e has been by now shown by Helimski to be post-Proto-Samoyedic, given Nganasan evidence for *e > †e > . On follow-up, also the reflexes of *ä > “*e” can be relatively open in some languages: Salminen (2012) has pointed this out about modern Forest Enets (e.g. *tät³tə > tät ‘4’), and to me it seems e.g. that the conditional developments *ä-a, *ä-å > *a in pre-Selkup also seem to presume an open value for *ä. Cf. *ān-uj ‘true’ < PS *änå, or *kuəsə ‘iron’ < *wåsV < *wasV < PS *wäsa.

What I call “Janhunen’s Law” is, though, not any sound change in Samoyedic, but a proposal that he had in the same paper for an innovation in some uncertain amount of western branches: PU *oCə > *uCə. Sammallahti (1988) indeed adopted it as an already Proto-Finno-Ugric innovation. Since then though there does not seem to have been too much support for it — but then neither critique or any other analysis either.

On any kind of closer look, it does seem clear this cannot be quite as simple as Janhunen suggests. First of all, also a correspondence western *o ~ PS *o exists. Janhunen identifies two examples: *koj(-wV) ~ *koəj ‘birch’, *kopa ~ *kopå ‘bark’. This number can be increased: clear examples also include *koj(ə)ra ~ *korå ‘male animal’; *kokə- ~ *ko- ‘to check, see’ (all of these with *ko-, but this looks simply accidental; *ko- > *kå- can be also attested in e.g. *kåmpå ‘wave’, *kåsə- ‘to dry’, *kåət ‘spruce’). Possibly also *ńoxə- ~ *ńo- ‘to pursue, hunt’, though Janhunen assumes that Finnic *nouta- continues earlier *ńux-ta-, thru a similar lowering as in *sou-ta ‘to row’ ~ PS *tu- < PU *suxə-, and this does not look entirely impossible.

I’ve observed already long ago (first presented at the 2nd International Winter School of FU Studies in Szeged in 2014) that there seems to be evidence for further conditioning. First, all of Janhunen’s positive examples involve front consonants in the medial consonantism: alveolars and labials. Four cases are immediately unambiguous:

  • *lumə ~ *jom ‘snow’;
  • *kusə- ~ *kot- ‘to cough’;
  • *purə- ~ *por- ‘to bite’;
  • *tulə- ~ *toj- ‘to come’.

I would add first of all two cases that should be reconstructed with *-w- and not, as proposed by Janhunen, *-x-:

  • *śuwə ~ *śo(-j) ‘mouth, throat’; *-w- is clearly indicated by Southern Sami tjovve.
  • *tuwə ~ *to ‘lake’; *u reflected at least in Permic *ti̮. Original *-w- seems to be indicated by Northern Khanty *tŭw, Konda tŏw, and maybe the oddly front-vocalic təw in rest of Southern Khanty. [1]

Probably even a third is *luwə ~ *lë ‘bone’. *-w- is again indicated by Western Khanty forms — mostly rhyming with ‘lake’, e.g. Konda tŏw, other Southern təw, Nizyam tŭw, Kazym ɬŭw (but in Obdorsk lăw, versus tuw ‘lake’). Samoyedic *ë could indicate a shift *ëw > *ow in other languages already before *o-ə > *u-ə (a tentative Proto-Finno-Ugric innovation — though this seems a bit too trivial and devoid of parallels to be relied on for that).

One additional example that was not known to Janhunen shows a palatalized alveolar medial: *wuďə ‘new’ ~ *oj- > North Selkup oć-əŋ ‘again’, a neglected etymology from Helimski (1976). [2] Note further that positing *o > *u here explains the rare initial combination *wu-, not reconstructed anywhere else in Uralic vocabulary and probably phonotactically impossible in Proto-Uralic proper.

Looking beyond Samoyedic, it also seems to be the case that from the evidence of other languages, we cannot really reconstruct word roots of shapes like *CoPə, *CoTə, *CoRə. The best two contenders are *monə ‘many’, *wolə- ‘to be’, but the first is readibly under doubt as being a loan from Indo-European (also Permic *-mi̮n, Mansi *-mān, Hungarian -vAn in names of decads does not particularly have to be related to ‘many’ in Finnic and Samic), and the latter looks more likely to have been *walə-. On the contrary, many reconstructions of the shape *CoKə have been already presented: at least *jokə ‘river’, *rokə- ‘to hack, cut’, *soŋə- ‘to enter’, *šokə- ‘to say’, *toxə- ‘to bring’; maybe also e.g. *poŋə ‘bosom’, *oŋə ‘hole’ (if not rather *poŋŋə, *aŋə). I take this also as grounds to suppose that there has indeed been a sound change *-oCə > *-uCə, for C ≠ velar.

I suspect also palatal *-j- might have blocked raising: cf. *kojə ‘male’ (though this is mostly continued in derivatives like *koj-ma, *koj-ra). An interesting case on this front is ‘to swim’, usually reconstructed as *ujə- per Finnic (Finnish uida, Estonian ujuma etc.), but most cognates (clearly at least Samic *vōjë-, Mordvinic *uj-, Permic *uji̮-, SKhanty üj-) better point to *ojə-. As I’ve noted by now in a talk from 2018, even within Finnic, Livonian vȯigõ (? < *oi-kV-) seems to still retain *o. The reflex in Samoyedic, on the other hand, mysteriously enough, is still indeed *u- or *uj-.

An alternative view?

The only counterproposal in any clear detail that I’ve seen comes from Jaakko Häkkinen, first in his Master’s thesis and later, much more briefly, on his 2009 paper on locating Proto-Uralic. He suggests inverting Janhunen’s Law, to apply in Samoyedic and not outside of it: *CuCə > *Co(C). I have seen / heard something similar by other colleagues in a variety of discussions, but I do not recall any defense of this being published. At most, see some discussion in this blog’s comments starting here, with Ante Aikio listing some notes about *o ~ *u variation within Samoyedic and additional irregular-looking examples of *o. Among these I would doubt at least the reconstruction PS *počå- ‘soak, ooze’, though. This probably refers to the words appearing in UEW under *poča- ‘become wet’; but Nganasan and (with irregular b-) Kamassian seem to point rather to *påTå-, with evidence for *o limited to Nenets–Enets. Or, since (old) Nganasan fo- can continue not just *på- but also *pə-, and Enets has o < *ə regularly, another option, maybe better still, would be that this was *pəčå- in PS after all, as would be expected per the Udmurt, Khanty and Mansi cognates; and that the Nenets word is a loan from Enets, while the Kamassian word doesn’t belong here at all. (Donner’s original data actually has not just a voiced b but palatalized , which is also difficult to explain.) In some other examples I don’t see any particular reason to think that they point to secondary *u > *o rather than secondary *o > *u (thus so maybe in “*num” ‘heaven’) or to *o at all (thus so in Nganasan tui ‘fire’ for expected ˣtüi: this looks like unclear retention of *u, which has other parallels).

Anyway, the major problem that I see in the inverted approach is explaining where Proto-Samoyedic *Cu(C) then comes from. There is solid evidence at least for a rime *-uj:

  • *tuj ‘fire’ < PU *tulə (a minimal pair with *toj- ‘to come’!);
  • *uj ‘pole’ < PU *ul(k)ə;
  • *kuj ‘spoon’ < PU ? *kujə (cf. Finnish kuiri ~ kuiru ‘id.’; I am not committed either way on if proposed Komi and Ob-Ugric cognates meaning ‘trough ~ mortar’ belong);
  • *puj ‘eye of a needle, etc.’ < *pujə.

The last two probably show PU *-jə > ∅ and PS *j as some derivative suffix, [3] but this alone cannot explain *u rather than *o, since also the latter readily occurs in CV stems: *ko-, *ńo-, *to, *śo-j. A few PS roots also show *u: natively at least *tu- ‘to row’ < PU *suxə; of unknown origin, *ku- ‘cord’, *ju ‘warm’ [4]. Some other CVC examples can be found too, including *pur ‘smoke’ < PU *purkə; *ut ‘road’ < PU ? *uktə. But at least these two examples we might argue to be irrelevant due to continuing PU *u in an original closed syllable, just with exceptional loss of *-ə after some probably very early cluster simplifications.

As comes to the lack of PS roots of shapes such as **Cup, **Cun, **Cuŋ, this could indicate that something happened to such cases, but it doesn’t follow that the result must have been *o. Other options would readily include reduction to *ə, already suggested by Janhunen in e.g. *təŋ ‘summer’ < PU *suŋə.

Future hypotheses

So far I do side with the hypothesis that Janhunen’s Law is a real phenomenon. Its exact extent and conditions seem to require review, however. I have some reasons to suspect that PU *o was in *CoCə stems retained not just in Samoyedic, but partly also elsewhere. E.g. *purə- / *porə- ‘to bite’ yields in Permic *puri̮-; *tulə- / *tolə- ‘to come’ yields in Mari *tola-; both more in line with development from *o than *u. An interesting recent discovery, premiered a few weeks ago on Twitter, has also been to note Khanty *lāńć ‘snow’ (> e.g. Surgut ɬ´åńť, Nizyam tɔńś, Obdorsk laś). UEW derives this from a distinct *ľomćɜ, listing here also some derivatives of PS *jom and probably incorrect Kola Sami reflexes meaning ‘frost’. But if we did reconstruct *lomə and not *lumə already in PU, the Khanty words, too, can be simply considered derived reflexes, at the PU level seemingly *lom-ća: *o-a > *ā is regular, and there does not seem to be counterevidence to assuming *mć > *ńć. Closer review might identify more cases like these that support the reconstruction of PU *o in the involved words.

As more of a long shot, there are also two unclear cases where evidence for *o might be found in Indo-European. For one, ‘to bite’ seems compareable with PIE *bʰe/orH-, root meaning probably ‘to strike, pierce’. The PU verb also probably meant specifically ‘bite thru’ (in contrast to *soskə- ‘to chew’), coming fairly close to ‘pierce’. Its descendants can be also used not of just biting with teeth, but also working with tools (cf. e.g. Fi. sahanpuru ‘sawdust’, as if “saw-biting”) — similar later development is attested in derivatives on the IE side too (Latin forō, Germanic *burō- ‘to bore, drill’) [5] and LIV goes as far as to give a gloss ‘mit scharfem Wergzeug bearbeiten’. Distribution all the way into Samoyedic makes it difficult to assume loaning, though, while a hypothesis about an old Indo-Uralic cognate would not, at the current state of research, rule out an original *u that was lowered to ablauting *e/o in PIE. — For two, there is Finno-Mordvinic *unə ‘sleep’, which Koivulehto (1991) has already compared with Greek ὄναρ, ὄνειρο- and explained exactly thru Janhunen’s Law: early IE *oner → early Uralic *onə > *unə. Whether the Greek word goes back far enough in IE for this to be feasible looks very dubious to me though, especially when there is a much better-attested PIE word for ‘sleep’, *swépnos.

A yet further possibility I would wish to look into in more detail in the future is, does the raising of *o that we seem to see really have the “same” *o as its starting point as is usually reconstructed in PU? Namely, traditional PU *o is in Samoyedic by default lowered to *å — such that its “survival” in Janhunen’s Law cases really looks to be also innovative really. As outlined in yet another presentation a few years ago, I have also developed a hypothesis that the unbalanced inventory of rounded vowels in Proto-Uralic: *ü *u *o but no **ö, probably comes by a chainshift from pre-PU *u *o *ɔ. (I have not discussed this on the blog in detail so far and, alas, cannot do so right now either.) Then, the common tendency of PU *o to be lowered to *a / *å probably indicates that this chainshift had actually not fully taken place by PU: that “*o” was really still open-mid *ɔ. Janhunen’s Law positions, however, look like they might have already had close-mid *o. This would allow us to do away with a raising that happened all across “Finno-Ugric” with seemingly no motivation, while still also not folding the vowel correspondence entirely into PU *u.

There would be also another option on the relationship of this *o with my pre-PU *u *o *ɔ. Rather than early raised cases of (pre-)PU *ɔ, they might be also straggling non-raised cases of pre-PU *o… And then was this *o really just an allophone of *ɔ either? *u is a very common vowel in PU, and perhaps this is partly because even some further cases should be likewise reconstructed as *o. This might be possible if we identified other evidence for it than retention as *o in Samoyedic. For the sake of example, one case might be Mansi *u: PU *u yields in Proto-Mansi either *u, *ŏ, *ă with no very strong conditioning apparent. (Some similarly open issues remain in Khanty and Hungarian.) So just maybe … could it be that PMs *u is a sign of PU *o as distinct from both *u and *ɔ in general? such that not only will we then reconstruct PU *por- ‘to bite’ (> PMs *pur-), but also e.g. *końćə ‘urine’ (> PMs *kuńćə), with *o > *u now also in Samoyedic in this environment (> PS *kunsə)? This would even have a good parallel among the front vowels: PMs *i is generally from PU (close-)mid *e, not from close *i. — But in the interests of putting these notes finally out at least in a somewhat assembled form, I will leave this line of thought open for now.

[0] See previously at least: Lehtinen’s Law; Moosberg’s Law; and one that definitely requires a name but I’m still mulling over what to call it precisely is *Ä-backing in Finnic. Several future installments remain planned too.
[1] On the contrary, an irregular fronting already in Proto-Western Khanty would also account for most of these reflexes: *tŭɣ > *tü̆ɣ > *tĭɣʷ > *təw, preserved in SKh and giving NKh *tŭw (cf. e.g. ‘fall’: PKh *sü̆ɣəs ~ *sü̆ɣs > SKh səwəs ~ süs, NKh *sŭws or *sūs). But it seems preferrable to me to restrict this irregularity to Southern Khanty and treat Konda tŏw and NKh *tŭw as regular reflexes. — Maybe there is some possibility that the SKh development here and in ‘bone’ can be explained as *ŭw > *ū > *ǖ > *ü̆w > əw, leveraging the known fronting *ū > *ǖ? It doesn’t look like *ŭw and *ū actually contrast at all, so the first step here might be entirely virtual.
[2] Хелимский, Е. А.: О соответствиях уральских a- и e-основ в тазовском диалекте селькупского языка. – Советскoе финно-угроведение 12: 113–132. No cognates known elsewhere in Samoyedic, but the simplification *wo- > *o- would have to be pre-PS anyway, since by PS a new *wo- does exist and per two examples yields in Selkup *ko- as expected: *woəj > *ko ‘island, hill’; *wotå > *kotə ‘blueberry’.
[3] Though, since PS shows *r > *l / C_ in various suffixes, could it be possible that after *j, the resulting cluster further coalescend to *ľ, and then evolved into just *j as usual? In this case Fi. kuiri and PS *kuj could both go back to PU *kujrə (now with no especial reason to suspect a suffix in there).
[4] For a formal match and semantics within speculation distance, cf. PU *luwə ‘south’ ≈ ‘direction where the weather is warm’?? Seems unlikely but not impossible.
[5] And cf. further PU *pura ‘drill’, also already proposed to be an IE loan. So far it seems morphologically unclear to me how to connect this with either the PU or PIE verbs, though.

Advertisement
Tagged with: , , , , ,
Posted in Reconstruction

State of the Blog: Second Decade

Blogging here at Freelance Reconstruction has been slowing down in recent times, as we approach the 10-year anniversary of its WordPress iteration, coming up just at the start of the next year. [1] In 2013–2019 I have been writing about 1–2 articles per month; in the 2020s so far, less than 10 per year. To be sure, some of life’s external issues and circumstances have also been getting in the way, starting already with the obvious: CoViD-19 and issues downstream of it. But this also coincides with me finally being now at the rank of a graduate student, and being not just welcomed but expected (as of this year, by actual funders even [2]) now put out my ideas as proper peer-reviewed publications. There is a whole bunch of work to do on this. Or indeed re-do: it feels like every article draft I sketch out ends up with at least one footnote to the effect “for earlier discussion of this, see Pystynen 2014 [blogpost]”.

Another turning point approaches too: where this blog will, at last, have more published than unpublished posts, both being at ca. 160. This may give a hint to what extent I have also quite a lot of unpublished research, most again formulated back in the mid-2010s, still stewing in my blog drafts. This is a situation that definitely calls for skipping over a step in the publication pipeline and refactoring this corpus, too, into other forms, now that I am able to do so. And this also does mean much fewer blog posts coming out as intended.

Even a third venue to air my ideas is by now moreover the Finnic Etymological Wiki Database, which I have been setting up over these same few years, under the folds of our project on writing a new etymological dictionary of Finnish (which uh, I don’t think I’ve ever announced here in detail; partly since it’s being written in Finnish). The platform is intended not just as a data backend for the dictionary, but also for discussion among scholars, e.g. for proposing new etymological ideas that do not seem quite ready for publication just yet. I’m by now doing this with some frequency, instead of spending more work on turning them into etymology squibs here (sample: is Mordvinic čakš ~ šakš ‘pot’ not a cognate of Old Finnish haaksi ‘ship’, but maybe a derivative from čava ‘plate’, if from earlier *šaɣa?). — Any colleagues interested in this, and with serious familiarity with Finnic etymology at least, are also welcome to request an account from me or the rest of the moderation team for contributing to the discussion.

By no means do I wish to abandon blogging altogether. But I may aim to shift away from the more effort-demanding blogposts to the effect of a mini-research article, at least as long as blogposts continue to be neglected by the powers-that-be as a recognized type of research output. Perhaps I will focus more here on reviewing issues, or bringing up points already made about them in the literature, than on presenting major syntheses on what to do with them. It remains to be seen how this will work out. But you can probably at least expect to see the next few Uralic reconstruction posts appearing here to be rather in this paradigm. Of course posting of other matters, e.g. on the state, context, philosophy and methodology of historical linguistics, is likely also going to be continuing on to the next decade. And maybe I will yet get around to re-hauling the site’s appearence or organization, as already hinted in 2019.

Thanks to all readers and commenters, and see you in the rest of the 2020s!

[1] The decennary of my linguistics blogging altogether has already slipped by about a month ago…
[2] I would also like to take an opportunity here to issue my thanks to Ante Aikio and Martin Kümmel for letters of recommendation to go with my funding pitch.

Tagged with: , ,
Posted in News

Long-Distance Comparisons As Butterflies

One of the rationality-cluster blogs here on WordPress, Aceso Under Glass, a while ago posted about a concept I find immediately useful: “Butterfly Ideas“. Roughly speaking, hypotheses that need further development, are probably not ripe for serious criticism as they stand, but could benefit from preliminary discussion (read the full post for more).

On this blog and elsewhere, I have repeatedly entertained a variety of “long-distance” linguistic relationships: Nostratic, Uralo-Yukaghir, Uralo-Eskimo, the works, despite not being so far highly committed to any of them. One idiom I’ve previously used to defend this is “big fish are worth angling even if you don’t catch any”; that there are major potential gains for our understanding of history (both intra-linguistic and extra-linguistic) if any of these theories start to prove themselves in more detail. Or as the more succinct modern spin goes, “big if true”. A second motivation is provided by what I have called the “cell theory of language“: spoken natural languages only come from other natural languages, never out of nothing. [1] This gives a strong prior that all natural languages are, indeed, related, even if we currently lack the knowledge of the details. Factoring in also anthropology further gives strong reasons to believe also in the existence of a number of “bottleneck proto-languages”, such as Proto-Australian, Proto-Amerind or Proto-Exo-African. So big fish are very likely indeed out there, even if we are not sure if our lures are working. Though then these are weaker boundary conditions that do not establish what currently-known families exactly would be the daughters of such a proto-language. E.g. who knows if some American languages might be not Amerind ≈ Beringian, but something else, like para-Na-Dene, pre-Clovis-coastal, Solutrean…? Continuing the metaphor, this would mean we don’t even know how big the fish are exactly, and so also we might not know (yet?) what are the best ways to catch them.

But there’s also a sense in which I think long-distance relationships would be better seen as butterflies than big fish. We do not find relationships in an instant, as sudden flashy discoveries (by “bites” on a “lure”). All spoken languages are in principle compareable, with known typological differences but also universal family resemblance. [2] The universality of basic phonological categories in particular makes it possible to find some resemblances between any two languages that plausibly could be indicative of some etymological or indeed genealogical relationship. Whether they actually are, depends on additional work on fine-tuning details. Are they above the level of pure chance, and independent of known onomatopoetic and nursery word trends? Are they in conflict with other data of equal value? Do they show recurring sound correspondences, at least some of them nontrivial? These are questions for which we cannot expect to have every answer in place immediately. Any relationship must always begin from observing some similarities that are not probative in itself, and then pursuing this as a hypothesis and seeing if it guides us to more similarities, ones that will not require further costly assumptions to justify.

If all we knew about Finnish and Hungarian were that their verbs for ‘to live’ are, respectively, elää and él, this would not be sufficient evidence to establish them as related languages. But they are, indeed, cognates. Insufficiency or statistical insignificance does not in any way refute cognacy per se. And it is true that checking for more examples of the correspondences e ~ é and l ~ l turns up more evidence such as pelätä ~ fél ‘to fear’. Now with a new correspondence p ~ f, but this does not mean we turn up our nose and declare the hypothesis unworkable: it’s possible to continue and maybe discover, say, pesä ~ fészek ‘nest’. It always takes several steps like this to assemble e.g. a phonological core that will be self-evidently non-accidental. Same for other “evidential cores”, such as partial common morphological paradigms. There is no immediate bite that instantly proves a relationship, but rather, a first weak signal, which will rise in importance once combined with a proper selection of other datapoints.

Any “minimum convincing argument” will not be dozens of steps deep necessarily, but where patience is especially needed is that at any stage there will be plenty of false paths of expansion that will not lead to a workable theory. If at some early point, we had formed a hypothesis of a ~ a, and then run into vapaa ~ szabad ‘free’ (without realizing that both are loanwords from Slavic) — we could still find more evidence also for p ~ b (e.g. by misanalyzing the correspondence Fi. mp ~ Hu. b), but no additional good evidence would be turning up for v ~ sz. At some point we might end up concluding that, yes, this is going nowhere and should be discarded. But then only this comparison! Finnish and Hungarian are still ultimately related, even if their words for ‘free’ are not cognate. Discarding this one comparison does not (should not) mean discarding also any other adjacent comparisons. A burgeoning comparative edifice needs to be open for exploration and individual mistakes, if it is to ever reach any particular rank like “a probable relationship” or “a proven relationship”.

This plea of course has also a corresponding inverse. Anyone who wants a “butterfly” treatment of their ideas has to have enough intellectual humility to recognize that it is, indeed, a tentative first-pass version. All too often I see also people who have a new language relation hypothesis in hand double down on their speculation, and not be open to even constructive criticism. Perhaps in some part there is a misunderstanding where people do not recognize the proposal of better, non-cognate etymologies (borrowing, onomatopoeia, internal derivation) as progress. But certainly also lone-wolf-genius-ism, and its attached incapacity to admit mistakes, is a problem that exists.

On the other hand, I don’t think this side of the problem needs to be focused on too much. In historical linguistics, the exploration of linguistic relationships is already a known research programme, a goal that many people agree to pursue even if we tend to disagree on quite a lot of details. This in mind, if a K. Kookenstein puts out a paper on allegedly showing how English is related to Arabic, but then refuses to consider these comparisons in light of what Indo-European or Semitic linguistics has to say on this: we don’t actually need his approval on this! Language data is not locked, copyrighted, or in any other way tied down to one person, and if desired, it will be possible in any case to check such papers for insights relevant also to better situated IE–Semitic comparison. I know I at least keep a few “Hungarian is too a Turkic language” type works around for this purpose. The intended main thesis is not going to pan out; but any data cited to this end could prove to be regardless still valid. Usually anything of this sort mostly relies on word comparisons (appeals to typology are strangely rare), and these might remain valid as etymologies of any imaginable type… not just Turkic loans in Hungarian, but maybe also old Hu. loans in Tk.; Hu. cognates of Khanty or Samoyedic loans in Tk.; common loans from some third source like Iranian or Yeniseian or Mongolic; some could even end up being evidence for a general Turkic–Uralic relationship. None of this is a priori ruled out, and in this way it may well be possible, with patience, to find meaningful building blocks even within theories that don’t hold up in their entirety. Such is a nifty property of historical linguistics, something that definitely doesn’t generalize to every science.

The two animal metaphors from the start of this post, though, no longer work very well at this point. Some butterflies … may grow up to be big fish, even though most probably don’t? Moreover, I have been mostly illustrating this discussion with disputed-but-definitely-published ideas. More nascent ideas that are simply brought up in a discussion are a different beast for sure. Of course there’s a selection bias here too: the actual butterfly ideas I do have, you will probably not be seeing on this blog as such (and you might have to watch closely to catch any even on my side channels). [3] Arguably also scientific publishing is “a conversation”… especially any ideas that can be so far found only in some paper draft posted for comments online (in linguistics they’re not even concentrated yet on any arXiv analogue). For these, the original reading of a butterfly idea seems to still work fairly well. This may hopefully help (e.g.) various long-distance proposals to develop better in the end, before they end up with one of two common fates: shelved as not having passed the judgement of Reviewer #2, or self-published with excessive confidence. For this goal, yes, the ball very much is first in the court of people who do have an idea and want to develop it; but it is also in the hands of the rest of us, in being willing to offer first criticism that’s not a complete dismissal. Thirdly, worth noting, all this also depends on a social milieu where people even can find parties interested in discussing some out-there idea.

A further aspect of AUG’s original concept — avoiding unnecessary emotional stress upon people presenting a new idea — I haven’t really even touched here yet. This would be a whole other jar of larvae, but suffice to say I agree that academic discussion, for all its standards of civility, fairly often can have undertones all the way to hostility. This probably scares away many people without a thick skin who might otherwise have had a few interesting things to say; and those of us who do stay engaged, to whatever degree, it may leave with more stress than is necessary.

Some of it, I’m sure, does not even come from a particular need to be prickly, but from limited time… Sufficiently well-known figures in a field tend to get approached by a disproportionate amount of amateurs with A Revolutionary Discovery, unless they specifically keep themselves hard-to-contact, or, perhaps, maintain an aura of not suffering fools gladly. Again a problem that might be softened with other people being open and approachable enough. But this also starts edging towards the general area of science communication and public relations, a bigger fish still to fry that I’m not going to pretend to already have big original ideas for right now (and the butterflies, they will have to wait for other channels).

[1] The famous case of Nicaraguan Sign Language does not seem to have spoken analogues. In principle there is little directly preventing such a case (and something of the sort, maybe in several gradual episodes, will have to be assumed as the ultimate origin of human language too), but the conditions are unlikely to ever come about. A community of children who are capable of speech but do not have access to any pre-existing spoken language? Sorry, language in general is too adaptive to have been ever abandoned after its first introduction. I will go as far as to suggest that all known human cultures depend strongly enough on language for the transmission of cultural knowledge that any sudden failure of language skills across an entire human group (say, a transmissible disease that induces deafness, fast enough that a signed language does not have time to develop) would not lead to an all-new language being developed a few generations later; it would lead to the group’s extinction.
[2] In the philosophical sense, not the genealogical one. E.g. despite some exceptions, most languages still have nasal or labial or velar consonants; all but the most impoverished and unbalanced phonological inventories or even just consonant inventories are going to have substantial overlap between them. And even if we did find languages that somehow have completely disjoint phoneme inventories (lazy example: one has only stop consonants and front vowels, the other only continuant consonants and back vowels?), they will not be unbridgably far apart: the known typology of sound change allows hypotheses relating basically any two speech sounds. Grammatical categories, too, can be quite different but still only finitely far apart, where the details of known language histories likewise give us ways to relate non-identical categories to each other (or to derive them de novo language-internally, etc.)
[3] A freebie for the sake of example though: cf. some very loose thoughts about the subclassification of Oceanic as floated on Tumblr just a few days ago (also already with some, though not highly severe, critique from a regular correspondent over there).

Tagged with: , , , ,
Posted in Methodology

Language Family Tectonics

Basic research in historical linguistics is mostly done within individual families: we take a swath of attested (in most cases modern) languages, and work towards the past to figure out their development from a common origin, one group at a time. Any knowledge of languages outside the family only really factors in as correction terms: filtering out loanwords and other contact influence, as data that the family’s overall internal history will not need to account for.

What the big picture of this looks like once we consider also geography is that we end up with a series of dots — “homelands” (though not to be understood as points of creation, but simply the last uncoverable phase of earlier processes) — somewhere in the past; some of which have then expanded, to cover the whole world by today. Just a few millennia ago, much of the world would have been an uncharted area, full of regions from which no knowledge of their languages has survived to us. The ones that do survive would, even, have been largely isolated dots. Most language contacts must eventually end (or rather, begin) at some point in the past. Languages of different families, that are today next to each other, cannot all have had their parents too as neighbors. Perhaps some individual cases were: Proto-Germanic seems to have been about as much of a neighbor of Proto-Finnic as Swedish and Finnish are still today; even further back, something like Proto-Kartvelian as a neighbor of Proto-Northwest Caucasian could be possible too. But once we consider highly expansive families, it is self-evidently absurd to propose that Proto-Indo-European could have been simultaneously a neighbor to all of (pre-)Proto-Kartvelian in the Caucasus, (pre-)Proto-Uralic in the taiga zone, (pre-)Proto-Dravidian in South Asia, pre-Basque in Iberia…

This already implies that most borders of today’s language families are collision zones: where two lineages have come to meet that were not in contact at some point in the past. (Same also for some, though fewer, language borders within them.) I’d like to think that we can probably divide them further in subtypes. This will have to include their history, not just their current but also past dynamics. One reasonable analogy might be plate tectonics. Geologists are not content to simply locate the current boundaries of the world’s tectonic plates, but ever since the rise of continental drift to a mainstream theory, already introductory maps will also aim to identify boundaries as either constructive, destructive or conservative. Often longer-term history or future, too, could be extrapolated from arrows of movement (of, yes, actual movement right now — as per the classic example and the mid-ocean ridge closest to me, the Atlantic Ocean is growing some three micrometers wider every hour, already a perfectly visible amount of maybe 0.3 millimeters since I began to write this blog post).

Of course this is not to be aped too closely. The social “forces” that drive linguistic expansions can be rather fickle, nowhere near as stable and predictable as the physical forces of geology in e.g. continental drift. No responsible linguist is going to be putting a predicted specific time of death on any but, perhaps, an already moribund language (those where all transmission to new generations has already ceased, and the only question is whether the last few speakers have 5 or 50 years left to live); and predictions on what languages will be gaining new ground entirely I have not really seen anywhere at all. If anyone wants to register particular predictions, be my guest, but currently these are really only going to be educated guesses, not derived from a theory with known predictive power.

So maybe let’s not draw any future-pointing arrows on linguistic fault zones just yet. Drawing past-originating ones, though, seems like a much more doable task, first of all in cases where (some) history is already known. And this I think also gives us anyway some analogues of geologists’ “constructive, destructive, conservative”. A look at known history actually suggests that just two types might be enough to get started. Of course we can have conservative boundaries, where languages have stayed each on their own side for a while. This often coincides with also geographic boundaries of some sort (e.g. the northern boundary of Indic has been, broadly, at the Himalaya for millennia, and it’s no wonder that the Korean / Japonic boundary has stabilized between the Korean peninsula and the Japanese archipelago). Then we have collision zones, where two lineages come head to head —

But wait. Head to head? No, actually, the most typical case we see anywhere in the world’s known history is not quite this. Where we find e.g. a Germanic / Celtic boundary in the British Isles, a Finnic / Samic boundary in northern Finland, a Turkic / Iranic boundary north of Iran, a Bantu / Khoe boundary in Botswana: these do not represent cases of two spread events that finally arrived at some common ground simultaneously, running out of no speaker’s land to claim. Almost always such a border represents one newer (Germanic, Finnic, Turkic, Bantu) and one older family (Celtic, Samic, Iranic, Khoe), with the latter’s historical range extending far into the former’s current-day one. The geological analogy happens to continue working here too to some extent: when two plates collide, for all the mountains that results, these still are not zones where both plates indefinitely squish and crumple without crossing. Instead one plate will be pushed underneath another, into the crust (and mainly the topmost one will jut up as mountains). Now the distribution of language families does not really have a Z-axis, but the time axis does similar duty here. We already routinely speak of e.g. English expanding (having expanded) “over” Brittonic; and call the latter a “substrate”, the former a “superstrate”, again employing terms from geology that strictly speaking refer to vertical location. I’m sure also a part of the motivation is one of geology’s core findings that, by default, vertical order reflects historical order!

To fully derive an understanding of this situation, the naive zeroth-order model of language family expansion (they start in some some compact area in the past and begin expanding) moreover needs to be amended by the fact that expansions are not infinitely powerful: they can run out of steam even without encountering another expansion in its path. Not only does Finnish supercede various lost Sami varieties, it is also not the case that Samic started somewhere in the north and expanded south until running into Finnic. Rather, Samic also itself originally expanded mainly northwards, probably much along the same geographic routes. There was no southward expansion front of Samic for Finnic to collide with; nor an eastward expansion of Celtic by the time of the Germanic expansions, etc. In this way linguistic expansions might have a better geological analogy still in lava flows in a volcanic field: they will layer on top of another, not by virtue of which one expands faster or more strongly, but by simple virtue of which one has already stopped, at least in a particular area, and which one is still going.

In those cases where two expansions do happen to be going on simultaneously, this is maybe indeed more likely to end up with something resembling a conservative boundary. And also among these, many though will prove not quite entirely stable if we look closely enough. They can turn out to be series of small advances on either side, just not spilling out to outright conquest of the other family (and likewise, mostly not inherently one-dimensional lines anyway, but a crossfade in the proportion of speakers of X versus Y). Again more like lava flows than continents.

Still, I will continue to keep the term “tectonics” here anyway. Etymologically looking, it is not a term that by itself implies the details of plate tectonics, but simply refers to the largest-scale analyzable units.


What can we do with this then? If we recognize that the world’s major language family boundaries are mostly collision zones — where one family is or has been in the process of expanding at the cost of another, not currently expanding one — this gives us first of all convenient rules of thumb about linguistic substrates. Anywhere near a language family boundary, the substrate of an expanding family X is probably primarily the non-expanding language family Y next to it. At least in the wide definition of “substrate”, that is “the language spoken there before the expansion of the current family”. If it has left any discernible substrate influence, structural or lexical or toponymic, would be another discussion entirely. Conversely, locations where we might be able to fruitfully hypothesize completely extinct substrates will be instead

  1. more towards the geographic or expansion centers of recently expansive families (thus e.g. the Paleoeuropean substrates of Germanic);
  2. underlying not-most-recently expansive families that have few or no leading edges over anything anymore (thus e.g. the Paleolaplandic substrate in Samic).

Or further yet. The facts that language families expand from small origins, readily take over other languages in the process, and are also generally just some thousands of years old, leads us to also a more powerful rule of thumb: There Was Some Other Language There Before. Almost no language is the absolute first language to have been spoken in “its” territory. The main exceptions would be a few cases of recent seafarers, above all in Polynesia; several more scattered cases also in the Atlantic, of which I think only Icelandic and Cape Verde Creole have been established as their own languages. [1] At any other ends of the Earth, Inuit is a known newcomer in the American high arctic, Pama-Nyungan is a known newcomer in the Australian interior desert (even if the languages preceding them are not attested)… and in places with long written history, we may find quite extensive known successions, to the effect of Hattic replaced by Hittite replaced by Luwian replaced by Aramaic replaced by Greek replaced by Arabic replaced by Turkish. Maybe some Assyrian or Kurdish phase in there somewhere too, depending on what point we’re considering here exactly. More importantly, over the remaining at least 60,000 years of modern human presence in West Asia without written records, obviously much much more of this still. Not all of this leaves major genetic or archeological fingerprints, either, and some specific cases might be very hard to identify if we didn’t have linguistics itself as a source of evidence.

For two, it will be generally beneficial to work out which of any two language families in contact at a particular border has been the more recently expansive one. [2] Know more widely, at least. I’m not sure if there actually are many cases where this would be a mystery entirely. I could think of some hard-to-tell cases once we’re talking about subfamily borders (Mari / Udmurt? Celtic / pre-Latin Italic?), but even here probably some dedicated experts would have an opinion. Maps of individual language families, especially in historical contexts, often enough also have some spread lines or historical distributions marked. But large-scale summary maps still trend towards presentations like this, seemingly entirely static, even though the process of restricting language families to complementary areas necessarily elides some current-day detail in favor of historical idealization (denoting where a language family “is native” or “is traditionally spoken”). I’ve seen sociolinguists criticize this whole genre of language distribution maps repeatedly already, in them not really capturing synchronic reality. The response though might not need to be to abandon them entirely, as much as admit that, yes, they are maps that display some historical information too, and adjust accordingly for more history-informed design. If there is knowledge on this mostly out there, why not?

For three, a concept of family tectonics readily draws attention to the point that there’s work to be done not just on charting language families’ “current” or “traditional” distribution, but also their past distribution. “Beneath” (before) any current language family there “is” (was) some different distribution of other languages. Some of them maybe belonging in it still extant neighboring families, some maybe its own lost relatives, some maybe unknown entirely.

The first possibility I find the most interesting for the sake of further work. The closest example to my work comes from central and eastern Siberia. An important but I think largely open question would be what was spoken in the area before the expansion of the relative newcomers? Russian is of course the newest layer all over the place, but Siberian Turkic (Yakut, Tuvan, etc.) and Northern Tungusic (Evenki, Even, etc.) are both parts of relatively recent families too. What have they ended up displacing? Early Russian explorers report, and rudimentarily attest to, first of all a formerly wider distribution of the Yukaghir family, today known only in two small islets; and a variety of Samoyedic and Yeniseic varieties in the southwest of this area. Still, the main Turkic and Tungusic expansions must have been early enough to predate all historical records in the region, so this cannot be the whole picture either. One hypothesis I keep coming back to is the possibility of a lost “tenth” Uralic branch — perhaps para-Samoyedic, perhaps an independent branch entirely. This might have some benefits to it in explaining a variety of known but not especially substantial similarities between Uralic and all the other families further east. Turkic of course has been in direct contact with (branches of) Uralic anyway, but various parallels continue sporadically into Yukaghir, Tungusic, Chukotkan, Nivkh, Eskaleut. All of them seem more likely to originate from the Uralic side, due to it being the Siberian family with the most known time-depth. Yeniseian is sometimes approximated as rather old as well, but otherwise both “Neosiberian” and “Paleosiberian” are all families without too much time-depth. [3]

Most notably, Uralic parallels in eastern Siberia include even basic words for ‘reindeer’, an all-important livelihood animal for many groups these days, especially Chukotkan *qora (whence the ethnonym Koryak), Tungusic ⁽*⁾oron (or probably *xoron, with further diffusion after *x > ∅ in NTg) (whence the ethnonym Oroqen). Kolyma Yukaghir qoroj ‘two-year-old male reindeer’ is usually adduced here too, as well as loanwords further into Siberian Yupik. This has been already identified in earlier research as a Wanderwort originating in Proto-Uralic *kojəra ‘male [domestic?] animal’ > Proto-Samoyedic *korå ‘id.; bull reindeer’, which might have already had an allophonic [q-] in Proto-Samoyedic or even earlier. But we seem to lack especially clear evidence on who is to be credited for the original diffusion of this word. Yakut, as far as I know, has no reflex of it, splitting the Eastern Siberian region off from Samoyedic, and thus probably suggesting a pre-Turkic movement eastward. If so, then maybe even already at the time of the original Uralic expansion (which I think must have been partly eastwards too in any case)? Who knows. Maybe someone will eventually though, if we get e.g. some additional toponym data for guidance and keep inter-family comparative research going.

Elsewhere in the world, I’m wondering also about e.g. how far Africa’s other language families might have reached before the Niger-Congo and particularly Bantu expansion. The case of possible contact between Khoe and Cushitic is already preliminarily discussed in a 2009 paper from Blench, though I’ve been unable to verify his interesting claim that Khoe #goe for ‘cow’ would be compareable with similar “widespread terms” in Cushitic. [4] The quite tattered Central Sudanic looks like another good candidate for a family that might have been more widespread earlier (but might have been also enroached upon by Chadic and the various branches of Eastern Sudanic). In the Americas, too, I could wonder especially what preceded the large continuous spreads of Athabaskan and Algonquian in most of Canada and the northern US? (And also which of them is the newer one?) Was there ever anything to the effect of “Inland Tsimshianic” or “Inland Tlingit”, “Plains Iroquioan” or “Forest Caddoan”? Or turning to Oceania: how far west and east did the various “”Papuan”” language families (many of them even today not confined to just New Guinea) extend before the Austronesian / Malayo-Polynesian expansion? For that matter has anyone even tried comparing any of these with the other continental SEA languages in any capacity, or just assumed that they must have been in splendid isolation amongst each other linguistically effectively forever?

These are questions that, again, some experts might already know answers to or at least have hypotheses for. But nowhere is this information available in centralized geographic form, even though it would be surely possible to represent so, giving a kind of a bird’s eye view of what are the major ethnohistorical results achieved or confirmed by historical linguistics, and what questions still remain open.

[1] Faroe Islands seem to be better established than Iceland as having had a pre-Norse population (at least as of the Nature study just last December). A longer list of cases without a distinct local ethnicity includes e.g. the Azores, Bermuda, Falkland Islands, Svalbard, Tristan da Cunha (and also remote islands in the other oceans, e.g. Kerguelen). There are some more within-reach cases like the Andamans, Maledives or Nicobars, for which I’m not sure what’s known of their prehistory (though then already the existence of two Andamanese language families suggests that one of them is very likely older than the other).
[2] Not always the same family on top in all interactions: Turkic has been expansive over Iranic, while Russian has been expansive over Turkic … and yet Russian and Iranian are both Indo-European. It should be no surprize at all either when we find e.g. language shift from Swedish into Finnish in Finland, vs. from Finnish into Swedish in Sweden.
[3] Really if “Neosiberian” is taken to mean “the recent but pre-Russian arrivals”, and “Paleosiberian” as everything else in the area — then we ought to be counting Uralic as the largest representative of the latter, not as some European family that somehow just happens to be also present. By now we do know the westernmost expansions of Finnic, Samic and especially Hungarian to be relatively recent, while Uralic or pre-Uralic presence in western Siberia has no established terminus post quem (short of the hard geological limit of the last ice age). — I suppose the usual exclusion of Uralic from “Paleosiberian” has been instead more informed by its typological similarity with Turkic and Tungusic. But then this seems improper when the term is Paleosiberian, not “Non-vowel-harmonic-siberian” or anything else of that sort.
[4] Checking with a recent monograph from Bender instead shows some very uncompareable-looking terms in most of Cushitic, such as Oromo /saʔa/, Konso /lawaa/, Agaw (North Cushitic) *lɨw-, South Cushitic *ɬee; or does Blench have some supposition about a Northeast Caucasian-esque *ɬ > *g?! — Further north, *gʷow- ‘cow’ in Indo-European does look amusingly similar to Khoe, but Afrasian is bit too wide and old of a family (definitely older than the domestication of cattle, which “only” dates to ~10,000 years BP) for me to think that there could be a connection entirely without it. Even something like the mysterious Y-DNA haplogroup R-V88, common in central Africa around Lake Chad yet seemingly derived from Eurasia, doesn’t really allow any connection that would reach all the way to southern Africa.

Tagged with: , , , ,
Posted in Methodology

Reviewing UraLex

Nerdsnipe of the day: the BEDLAN team, researching diversification of the Uralic languages interdisciplinarily, mentioned earlier today that they will be soon uploading version 3 of their UraLex dataset of basic vocabulary across Uralic. I thought this might be a good time to do a look-over of the data, from a not-that-computational historical linguist’s point of view (i.e. mostly on the contents, not the technical details). Maybe these comments will be helpful either to the team or to other people aiming at similar projects.

Data sources

The selection / definition of languages looks mostly good already to me, with varieties being specified fairly closely, including details like “Sosva Mansi” rather than just “Northern Mansi”. Unmarked “Selkup” is however questionable at least. This is claimed in the documentation to be more specifically Taz Northern Selkup, the currently most vital dialect [1] and the basis of current written Selkup. The listed forms, though, often look more like the Proto-Selkup reconstructions from Sölkupisches Wörterbuch, e.g. in retaining PSk *č (> modern NSk /t/) and *uə (> *Cʷë > modern NSk /Cɤ/, /wɤ/). A similar issue is the database’s “Karelian Proper”. This too does not appear to be any real variety of Karelian, but rather the interdialectal lemma forms of Karjalan kielen sanakirja, which are frankly overly Finnishized (not really actual Proto-Karelian), and elide many important contrasts, especially voiced obstruents and, mostly, the s / š contrast. E.g. rasva for ‘fat’ only appears as such in the Oulanka dialect. Most northern Karelian has rašva, much of southern Karelian razva, some intermediate southern dialects ražva.

The KKS and SkWb lemmas are probably tolerable as lexicostatistic indices to Karelian and Selkup, but I hope some future update might fix this in favor of actually-recorded language varieties — and certainly before anyone tries to do phonological analysis with this data!

I would have some desiderata myself on what varieties’ classification would be interesting to gage by their lexicon. Foremost maybe transitional varieties, such as Karelian Isthmus Finnish; NE Erzya and Shoksha; Pelym, Lozva & Eastern Mansi; Berezovo, Nizyam, Salym & Vartovskoe Khanty; anything really among the Selkup dialects. But it’s possible that this is too fine detail for a Uralic-wide dataset and would call for within-language-group studies instead, similar to Rydving (2013) on Sami. And it appears that the most important additions for within-Uralic study have are already been planned: adding Moksha besides the currently represented Erzya; Hill Mari besides Meadow Mari; Obdorsk (Northern) Khanty and Pelym (Western) Mansi varieties besides EKh and NMs; Kamassian and Mator within Samoyedic. These should cover many bases. E.g. the well-known Mansi cognate(s) of Hung. tűz, EKh tö̆ɣət ‘fire’ are not recorded from NMs, but do appears in WMs (Pelym toåwt, Upper Lozva töät, North Vagilsk tüöwt, etc.)

A different point entirely is that attempts to study specifically the interrelationships of the nine basic Uralic branches would, I think, function the best if using their protolanguages as the basic data points. There are too a few gotcha cases where no coverage of modern-day languages is sufficient: occasional native Uralic terms might be reconstructible for Proto-Mansi only from early 19th century wordlists, for Proto-Samoyedic only from Castrén’s mid-19th century records, for Proto-Mordvinic only from Witsen’s 18th century records, for Proto-Hungarian only from early medieval records, etc. Comparative-historical Uralistics is maybe not particularly philology-centered, but has never been able to afford overlooking philology entirely. [2]

The selection of semantic concepts to cover is generally reasonable, pulled from major basic vocabulary lists like various Swadesh lists and the Leipzig-Jakarta list. Some of the items on these do break up completely to noise within Uralic, but that’s a good point to have on record as well. I do not think the classic Swadesh list was assembled very rigorously, and at some point it would be good to know not just something about the relative average stability of concepts on it, but also their variance in stability across different language families. An example I have often mentioned in dicussions related to this is how in Uralic, ‘fish’ and ‘moon’ are highly stable, while ‘cow’ is unreconstructible and ‘sun’ is highly unstable; while in Indo-European, ‘cow’ and ‘sun’ are highly stable, vs. ‘fish’ unstable and ‘moon’ just about unreconstructible. (This phenomenon e.g. already constitutes a fairly strong critique of glottochronology or any models resembling it, which would rather predict average variance to be a monotonic function of average stability.) — Many of the more unstable and entirely unreconstructible concepts seem to be from the LJ list. This is basically what we should expect I think, since these have been selected only by their stability vs. loaning, not vs. all the other lexical innovation processes out there like derivation, semantic shifts, onomatopoeia, a priori coinages (and also not even vs. the likelihood of synchronic synonymy).

There are regardless still many semantic concepts or etymological groups that I think would have a bunch to say about the diversification of Uralic, but which haven’t made the mark. These are I suspect typically more Uralic-specific, and they could not be easily located by general cross-linguistic considerations. Simple examples include e.g. terms for local fauna (*śixələ ‘hedgehog’, *onča ‘nelma, Stenodus‘), flora (*ďëmə ‘bird cherry’, *pečä ‘pine’) and technology (*joŋsə ‘bow’, *ńëlə ‘arrow’). More involved examples tend towards etyma that Helimski (2001) has called core vocabulary as distinct from basic vocabulary: often verb roots, relational terms, or incipiently grammaticalizing body part terms, that may not have strong semantic stability but do have decent etymological stability. In Uralic thus e.g. *kixə- ‘to rut, lek, be excited, lustful, want’, *kulə- ‘to go out, run out, wear, end’; *pučkə ‘hollow, tube, inside, marrow’; *pončə ‘tail, hem, back part’ (glosses not meant as PU but indicating the range of variation in reflexes). Most regular lexicostatistic methods run poorly however if matched against etyma that don’t have stable or well-defined proto-meanings, e.g. we can’t really ask what is “the” replacement of such an item in a language that has lost it. Down the line, some new techniques entirely will be required for making use of this kind of data instead.

Phonetics & Phonology

I do not know what use, if any, is planned for this part of the data, but especially inconsistent IPA transcription seems to remain a major problem, as many other times in Uralic studies.

  • v is transcribed as a fricative /v/ rather than the approximant /ʋ/ for Estonian, Votic and Ingrian (though correct in Finnish).
  • A phenomenon I’ve seen in many online sources over the last ~10 years, Finnish h is given superfluous and partly incorrect transcription as /ç/, /x/ in many clusters and /ɦ/ in many medial positions. E.g. karhea ‘rough’ as “/karçe̞a/”, though fricative allophones only appear with any systematicity in the syllable coda. Even then these have enough variability that I would think leaving this as phonological /h/ would be surely the safest choice.
  • Some Finnish falling diphthongs are transcribed with glides as the 2nd component (aurinko ‘sun’ /ɑwriŋko̞/, koira ‘dog’ /ko̞jrɑ/), others with close vowels (jauhaa ‘crush’ /jɑuɦɑː/, oikea ‘right’ /o̞ike̞ɑ/).
  • Estonian length marking is a mess. -p- -t- -k- appear seemingly at random as both /p t k/ (thus also -b- -d- -g-) or /pː tː kː/ (thus also -pp- -tt- -kk-); sometimes even in the same word, e.g. lükata ‘to push’ as “/lykɑtːɑ/” (as if ˣlügatta ?)! I don’t have strong opinions on if it’s more proper to use /pˑ tˑ kˑ/ for transcribing grade 2, or maybe /pːː tːː kːː/ for grade 3, but please at least make the distinction. — I’m not even going to start on long/short clusters or overlong vowels, which are maybe less phonologically relevant anyway.
  • Estonian palatalization has also gone absent, e.g. lill ‘flower’ as /lilː/ and not /lilʲː/. Also, four slip-ups of õ turning up as IPA /ɣ/ rather than /ɤ/: “/hɣːrutɑ/” ‘rub’, “/kɣvɑ/” ‘loud’ (but correct in /kɤvɑ/ ‘hard’!), “/lɣkːs/” ‘trap’, “/mɣmisetɑ/” ‘mumble’.
  • Votic transcription includes some allophones like [d̥ g̊ vʲ ɑˑ], but leaves unmarked maybe the most prominent allophone in the language, л = [ɫ], “dark L”. I did not catch any ˣ/ɣ/ pro /ɤ/ mistakes.
  • I’m happy to see that most languages’ palato-alveolar ľ, ń, ś etc. have been transcribed as /ʎ/, /ɲ/, /ɕ/ etc. rather than incorrect /lʲ/, /nʲ/ /sʲ/ seen in many naive attempts to IPA-fy Finno-Ugric transcription; … but this has been overdone to include also Erzya, for which palatalized alveolars are correct. Not a major issue ultimately, but still an inconsistency.
  • Meadow Mari ə̑ has been transcribed as /ə̱/, which is a bit superfluous; /ə/ would be sufficient. (It is rather Hill Mari ə (= reduced e) that would call for a diacritic in IPA, probably /ĕ/ or /ə̟/.) — The Ob-Ugric data has had the ə / ə̑ distinction phonologized away entirely, though if desired, it could be maintained phonetically at least in Eastern Khanty.
  • Komi and Udmurt: FUT / literary ‹ы› is given as /ɯ/, rather than the more correct /ɨ/, and / ‹ӧ› has been rendered as /ɤ/ though probably /ə/ or /ɘ/ would be likewise more consistent (as in the Oxford handbook of Uralic from this spring). Even a / ‹а› might be for the Permic languages better rendered as IPA /a/ (unlike most of Uralic, where a contrasts with /æ/ and is thus indeed better rendered as IPA /ɑ/).
  • Hungarian uses tie bars for some its affricates, /t͡s t͡ʃ/ etc. Not incorrect in any way, but this is used nowhere else in the data and not even entirely consistent within Hungarian. I also notice a straggling flap /ɾ/ appearing in erdő ‘woods’, féreg ‘worm’ that seems like an error.
  • Uvulars in Khanty aren’t dealt with very consistently at all. [q ʁ] as back-vocalic allophones of /k ɣ/ go unremarked, but /χ/ is indeed transcribed as uvular (ditto in Mansi). Worse, some data with /χ/ has been incorrectly entered for Vakh-Vasyugan Khanty, e.g. jŏχət- ‘come’, koχ ‘long’ (the actual VVj forms are jŏɣət-, koɣ). Only western Khanty ever has χ!
    I suspected data mix-up initially, but this clearly must be a processing problem instead, given even e.g. köχ ‘stone’: no such form appears anywhere in Khanty (it’s VVj köɣ, Jugan kä̆w, other Surgut kä̆ɣʷ, all western kew). Are these words derived from some orthographic source that spells VVj /ɣ/ as Cyrillic ‹х›, by any chance? (But still correct forms in many other cases like ‘head’, soɣ ‘worm’, wajəɣ ‘bird’.)

Looking over these issues, I could formulate a Rule #1 for IPA-fying FUT: the transcription systems do not correspond 1:1 and several details must be, alas, checked on a language-by-language basis. Especially vital is understanding your source data: whether whatever you are IPA-fying is pre-WW2 “hyperphonetic” FUT; mid-century “major-allophonic” FUT; or post-70s “phonological” FUT. IPA comes with its bracket notation [d͇], /d/, //ð// etc. to warn what level of transcription you might be dealing with… FUT does not, perhaps its biggest flaw. A related Rule #2 might be that it’s similarly important to understand what you are trying to do with IPA: phonological, broad phonetic or narrow phonetic transcription? Most of the time, there is no One Correct IPA Representation either.

In the base FUT data I do not see any further major issues. It would be probably good to make sure to distinguish ´ (the suprasegmental palatalization sign) and ˈ (the overlength / strong-grade cluster sign) in the Samic data though. Currently both seem to be much of the time encoded as a simple apostrophe; e.g. Inari Sami kyevˈđi ‘snake’, Skolt ku´vdd ‘id.’ are given as “kyev’di”, “ku’vdd”. Occasionally even opening or closing single quotes appear (thanks, Microsoft). Apostrophes do actually even triple duty in marking palatalized ľ in other languages, but this seems unlikely to do any real harm.

Protoforms

The dataset is of course primarily about attested lexical data, so I maybe should not spend too much time on examining the proto-language reconstructions included (only Proto-Uralic, no intermediate reconstructions). Still, this is protouralic dot wordpress I am blogging at, so some observations on that topic too.

The transcription scheme seems to closely follow Janhunen 1981, Sammallahti 1988. The *i/i̮ reconstruction for noninitial syllables is used almost thruout; an *-e- has slipped in only in *koje-mV ‘husband’. *i̮ rather than *e̮ is used in initial syllables too, however still an **a in at least a few lexemes like *maksa ‘liver’, *maɣi̮ ‘earth’ (= J *mi̮kså; S *mɨkså, *mɨxi); also *ńś rather than *ńć, though a traditional *ć is still retained in some cases. Different transcription schemes are more inconsistently mixed for the “voiced spirants”, including ‹δ› in *śaδa- ‘rain’, but ‹ð› in *wuði̮- ‘new’; ‹x› in *juxi̮- ‘to drink’, but ‹ɣ› in *miɣi- ‘to give’.

A possible consequence of the dataset’s original compilation for a lexicostatistic review of the traditional Uralic classification is also that some meanings are marked as “[Not reconstructible]”, although they would have well-established though western-leaning proto-forms, e.g. *külmä ‘cold’ (maybe debatable; an IMO poor loan etymology from Balt(o-Slav)ic remains marked for the reflexes), *mälə ‘mind’ (clearly PU; this is reflected in derived verbs in Ob-Ugric), *läwlə ‘heavy’ (EKh ‘cold’ probably doesn’t belong). Some items reconstructed in recent literature are missing too, e.g. Aikio’s revamped *këččə ‘bitter’, *widä- ‘to kill’. More worrying for me is how also many long-known proto-forms are left absent, such as *küsə ‘thick’, *näkə- ‘to see’ (admittedly most reflexes derivatives w/o this meaning), *lükkä- and *puskə- both ‘to push’, *śepä ‘neck’, *sańća- ‘to stand’, *wëlkə ‘white’. I don’t think this can be just due to later semantic divergence in some reflexes, when e.g. *jelä ‘day’ has been admitted as a PU form only from Samoyedic direct evidence (parallels also at minimum in Samic); and *śilä ‘fat’ from no direct evidence at all? Yet also some poor comparisons from UEW seem to remain around, e.g. “*čočV-” ‘to wipe’; actually its only reflex meaning ‘wipe’ is Finnish huosi-, which I don’t think can belong here. [3] — These types of issues may even combine for more involved cases. E.g. the PU word for ‘full’ is given as *türə, a narrowly distributed Finnic–Permic etymon, and not the better-distributed *täwdə. This is again probably per UEW, which maintains Selkup tīr as reflecting the former and not, as recognized since Aikio 2002, the latter. [4] Or, the word for ‘year’ is given as *ärV; but this reconstruction was in effect already refuted by Aikio 2012, who points out that the Samoyedic forms (meaning ‘fall’) go back to PSmy back-vocalic *ër-, which continues rather the already better-distributed PU form *ëdə. [5]

A methodological choice also seems to have been that no synonyms are admitted for PU, although there probably are a few concepts in the data for which they existed; e.g. besides *śilä for ‘fat’, we can reconstruct also *wajə, *koja (both already alluded to in the database; the former though specializing to ‘butter’ in most Uralic languages familiar with agriculture).

(All my Uralonet links above show what I think of as their most reliable reconstructions, but defending those would be at times quite a debate that I don’t intend to get into in detail here — I’ll be happy as long as the reconstruction system chosen is at least internally consistent enough.)


Since following newer literature adequately appears to have given some difficulty for the team, I would like to note here (I think for the first time on this blog) that I’ve already a few years ago started a little repository of new results in Uralic etymology, currently keeping track of

  • newly proposed PU reconstructions;
  • newly found reflexes of known reconstructions;
  • newly found loan etymologies for what have previously been thought of as native Uralic etyma.

The list(s) can be found at the Sanat wiki, as a part of / appendix to our etymological database of Proto-Finnic. [6]

Currently pending updates include, besides better coverage of several earlier but post-UEW sources, especially several new native and loan etymologies for Mari and Permic from Metsäranta’s PhD thesis from 2020. I have also been thinking of starting an “antietymological” sister repository, tracking PU reconstructions that have been clearly disproven by better etymologies being published for all or all-but-one of their reflexes, of which there are quite a few by now too.

Etymological marking

Maybe the core content of the dataset. Standard literature has been followed quite faithfully here and I see no major flaws (even where etymological relationships have not been seen fit to be promoted to Proto-Uralic status). Mostly I can just point out some recent and overlooked results. Besides cases already mentioned:

  • The Hungarian word for ‘claw, nail’ has been unfortunately given as the less basic karom ‘claw, talon’ rather than köröm ‘claw, nail’; which was, even, recently argued by Aikio to be indeed a reflex of PU *künčə.
  • The Samoyedic words for ‘to scratch’ derive from PSmy *kətå ‘nail’, much as also e.g. Khanty *kö̆nč- does double duty as ‘nail; scratch’. The base noun is, probably correctly, not admitted as a cognate of rest-Uralic *künčə. The verb entry however inconsistently does encode them as cognates.
  • The most notable loan etymology missing entirely is probably the derivation of Erzya veśe and Hungarian össze- ‘all’ from earlier *wiśwV- ← Proto-Indo-Iranian *wićwa- > Sanskrit víśva, etc. (an etymology due to Katz 2003 that was unfortunately overlooked by Holopainen 2019). Both are regular: for Hungarian *wi- > *wü- > *ü- also in IIr. loans (besides native ones like *widä > öl- ‘to kill’), cf. the already long-known özvegy ‘widow’ < *wiðVwädźV ← Scythian / pre-Alanian *widawa-čī.
  • There is probably room to adjust many of the individual loanword etymologies, e.g. Kildin Sami sūll´ ‘salt’ is not borrowed from Russian сол but, as maybe the palatalization best reveals, from Finnic *soola (thus also UEW, SSA). This would regularly continue a Proto-Samic *sōlē > Peninsular Eastern Sami *suəllʲe, also present in Skolt suõ´ll. Would be way too much work for me to start digging into these on my own though with any consistency.
  • There are, on the other hand, still several Proto-Indo-European loanword etymologies advanced that do not seem very reliable (were they ever widely accepted?), e.g. *pelə- ‘to fear’ ~ PIE *pelh₁- ‘to shake’ (which only gives ‘fearful’ in derivatives in Gothic and Slavic); *śalə ‘gut’ ~ PIE ? *ḱolH- ‘turn’ (which only gives ‘gut’ in Greek). These are though only marked as “probable”, not “clear” — is this basically an euphemism for “not that likely”?

I suppose this is by now enough comments for one day. I know that assembling and curating datasets this big is quite the task, and I could probably also spend a week more reading this in further detail. Hopefully I’ve already pointed out some productive directions for future improvement though. (And if you were thinking of otherwise releasing 3.0 just tomorrow: sure, don’t mind me, there will be time in the future too to improve things.)

Edit 2022-06-27: See also some brief responses from Outi Vesakoski (and further from me) at Twitter!

[1] Very relatively so: at triple rather than double or single digits of speakers.
[2] So far the biggest gap in philological coverage are probably the old Swedish “Biblical Sami” records, substantial already in the 18th century, but to my knowledge they have never been looked over in detail etymologically.
[3] Has been further etymologized as being maybe from Proto-Finnic *hosja ~ *hoosja ‘horsetail, Equisetum‘ (traditionally used to make scrubs), which I don’t think has itself any etymology yet. By its phonological structure it obviously cannot be native Uralic as is. Inverting the semantic derivation though, an irregular (?) contraction from an agent noun *hosija < *hose/i-ja ‘sweeper, scrubber’ might be possible (cf. also Fi. hos-u- ‘to work carelessly, in a rush’). Or if this is, as UEW’s etymology would imply, really assibilated *hocija… a root that looks somewhat compareable to me is Samic–Mordvinic *šodə- ‘to let out, run out’ (maybe first derived to *šodə-j- > *hoci- ‘to throw/sweep things out’). A PU *čočV-, on the other hand, should not give Finnic *h- but *s-, via the affricate dissimilation seen also in e.g. *čečä ‘uncle’ > *ćečä > PF *setä.
[4] Worth noting, besides Aikio’s argument that cognates elsewhere in Samoyedic require a protoform with *ä-ə, is also that *türə would be expected to give Sk. **tir with a short vowel. tīr shows Helimski’s Law = Proto-Selkup vowel lengthening in Proto-Samoyedic *ə-stems, < PU *CVCCə stems and some *CV(C)CA stems (a relatively recent discovery from 2007).
[5] This does still leave Permic *ar ~ (Core) Mansi *ārmə (closed syllable per Pelym årəm with a short vowel), but the latter should clearly be analyzed a loan from the former; more specifically, from derived *arm as reflected in Udmurt. Permic *a has no well-established native source at all and even some more dubious cases only really point to some possible origin from *ä.
[6] “Us” being myself, Santeri Junttila, Sampsa Holopainen & Juha Kuokkala, plus original data assembly by Kallio.

Tagged with: , , ,
Posted in Commentary, Links

A Finnic Family Tree

I was recently asked on Twitter about the history and subclassification of Finnic. [1] Whipping up a full-length discussion paper or even a polished nice-looking family tree would be more work than I can produce on short notice or on free time (and probably something that might warrant wider publication still), but since I actually do have several opinions about this, that are probably either scattered in several places or that I haven’t mentioned anywhere yet, here is a summary of my current thinking.

I’ve given datings of proto-languages and extinction dates only where I can pretend to have any sense of accuracy to them. Error ranges are at least ±100 years for the former, at least ±10 for most of the latter.

┐ Proto-Finnic (ca. 500 BCE, ? middle Daugava)
├─┐ Proto-South Estonian (ca. 500 CE, ? upper Gauja)
│ ├── † Leivu (northern Latvia; extinct 1988)
│ └─┐ Mainline South Estonian
│   ╞══ Mulgi South Estonian
│   ╞══ Tarto South Estonian
│   │   [basis of Old Literary South Estonian]
│   └── East South Estonian (Võro–Seto)
├───┐ Proto-Livonian (ca. 1000 CE, lower Daugava)
│   ├── † Salaca Livonian (northwest Latvia; extinct ca. 1870)
│   ├── † Riga Livonian (unattested, extinct in the 13th C?)
│   └─┐ Courland Livonian
│     ├── † Eastern Livonian
│     ├── † Central Livonian
│     └── (†) West Livonian
└─┐ Proto-Core Finnic (location?)
  ├─┐ Proto-Central Finnic (location?)
  │ ├─┐ Estonian proper
  │ │ ╞══ Insular Estonian
  │ │ ╞══ East Estonian dialects
  │ │ ╞══ West Estonian dialects
  │ │ ╞══ Central Estonian dialects
  │ │ ╘══ North Estonian proper
  │ │     [basis of Modern Standard Estonian]
  │ └─┐ Proto-Votic (inland Ingria)
  │   ├── † Eastern Votic (extinct 1976)
  │   ╞══ † Central Votic (extinct > 1950)
  │   ╞══ Lower Luga Votic
  │   └── † Krevinian (Southern Latvia; extinct ca. 1850)
  └─┐ Proto-North Finnic (ca. 0 BCE, ? coastal Estonia)
    ├─┐ Proto-Northwest Finnic (? coastal Estonia?)
    │ ├── Northeast Coastal Estonian
    │ ╞══ Taivassalo / Very Southwestern Finnish
    │ ├─┐ Southwesternish Finnish
    │ │ │ [main basis of Old Literary Finnish]
    │ │ ╞══ North SW dialects
    │ │ ╞══ South SW dialects
    │ │ ╞══ Western Uusimaa dialects
    │ │ ╘══ probably other dialects in the SW transitional zone
    │ └─┐ Mainline Finnish (ca. 200 CE, Kumo River)
    │   │ [main basis of Modern Standard Finnish]
    │   ╞══ Lower Satakunta dialects
    │   ╞═╤ West Upper Satakunta dialects
    │   │ └── Austrobothnian Finnish
    │   ╞══ Ostrobothnian dialect chain
    │   ├── Kemi Finnish
    │   ├─┐ Torne Valley Finnish
    │   │ ╞══ Lower Torne Valley dialects
    │   │ ╘══ Upper Torne Valley dialects
    │   │     [incl. Meänkieli & Kven]
    │   ├─┐ Kalix Valley Finnish
    │   │ ├── † Lower Kalix Valley Finnish (unattested)
    │   │ └── Jällivaara Finnish
    │   └─┐ Core Tavastian (ca. 300 CE)
    │     ╞══ East Upper Satakunta dialects
    │     ╞══ Heartland Tavastian dialects
    │     ╞═╤ South Tavastian dialects
    │     │ └── colloquial Helsinki Finnish
    │     └─┐ East Tavastian
    │       ╞══ Southeast Tavastian dialects
    │       └─┐ Northeast Tavastian
    │         ╞══ Päijät-Häme dialects
    │         └─┐ Karelid Finnic
    │           ╞══ Savo dialects
    │           ╞══ Karelian Isthmus / Southeast Finnish dialects
    │           ╞══ Ingrian
    │           └─┐ Old Karelian (ca. 700 CE, NW Ladoga)
    │             ╞══ Olonets Karelian
    │             ├── † Sortavala Karelian (unattested)
    │             │   [substratal to Sortavala Finnish]
    │             └─┐ Karelian proper
    │               ╞══ Viena / Northern dialects
    │               ╘═╦ Southern dialects
    │                 ╚══ Central Russian dialects
    │                     (Tver, Tikhvin, Valdai)
    └─┐ Ludian–Veps (ca. 600 CE, SE of Ladoga)
      ├── † Olonets Ludian (unattested)
      │     [substratal to Olonets Karelian]
      ╞══ North Ludian dialects
      ╞══ Central Ludian dialects
      ╞══ South Ludian dialects
      ╞═╗ North-Central Veps
      │ ╠══ Northern Veps dialects
      │ ╚══ Central Veps dialects
      ╞══ Southern Veps dialects
      └── † North Chudian (unattested)
          [substratal to some Northern Russian,
           in contact with Proto-Komi]

The South Estonian sub-tree here is the part that has been published the most recently, basically from Kallio (2021, 2018); though I’d like to see more detail on the suggested Tarto–VS group still.

Some other divergences of note from earlier Finnic family trees include:

  • No Coastal Finnic (Livonian + Core), contra Kallio. I will be arguing for this in detail in a future paper. Among the early branches, Core Finnic and Central Finnic seem to hold up better so far, though I’m open to the possibility that some North Estonian dialects may eventually prove to have some fairly deep archaisms to them too. North Finnic I have several suspicions about, but Ludian–Veps still has nowhere better to go in the tree than with my “Northwest Finnic”.
  • No East Central Finnic (East Estonian + Votic), contra Viitso. These are united only by some cases of õ, which I however consider to be archaisms already from common Central Finnic. [2] This also allows for (re)introducting a non-paraphyletic Estonian sensu stricto.
  • Paraphyletic Western Votic, directly following Kuznetsova, Muslimov & Markus (2015).
  • Paraphyletic Western Finnish and Tavastian Finnish, generalizing further from Kallio (2013). Purely by linguistic evidence, the traditional “Western Finnish” grouping would be about as well-supported as my “Mainline Finnish”, but settlement history to me seems to strongly favor the latter: the Karelid group can’t just drop out of nowhere, it needs to be derived from somewhere at the time in the early 1st millennium when there simply wasn’t any Finnic presence yet in eastern Finland (but parts of western Finland had already been Finnic-speaking for some centuries, with presumable incipient diversification). Archeology so far does not favor an independent expansion from the south; the river Kymi would look like a good route candidate for that at first, but it might have been simply too non-navigable with its several major rapids. Hence, Karelid Finnic must be nested not just within “Finnish”, as has been known already for long, but indeed within “Western Finnish”.
  • Polyphyletic Ostrobothnian Finnish. Some of these lineages may eventually prove to be offshoots of specific Western dialect groups further south, but current research really hasn’t even started that line of investigation (though see next item).
  • Austrobothnian (= my term for South Ostrobothnian) as a West Upper Satakunta offshoot specifically. This is a well-known fact of settlement history, but has some implications for analyzing what is areal and what is old inheritance across the Western Finnish dialect continuum that I don’t think have been fully appreciated in the past.
  • No Karelian–Veps group. This seems like a no-brainer to me: there are practically zero common innovations (some lexical evidence has been claimed but without ruling out common archaisms or loanwords) vs. quite abundant Finnish–Karelian = Northwest Finnic innovations, even beyond the Karelid group. Some more narrowly distributed, e.g. Ludian–Karelian, innovations exist, but their absense from Veps or eastern Finnish I think immediately shows them to be areal rather than genealogical.
  • Paraphyletic Ludian and perhaps Veps. The latter above all due to the fact that most innovations in Veps could be attributed to Russian influence or at least are downstream of changes due to this. Not tying down the assumption that Veps must be monophyletic seems like the safer bet so far.

I take no stance here on the still gradually ongoing debate on if the Kukkuzi dialect is Votic-with-Ingrian-superstratum or Ingrian-with-Votic-substratum or a mixed variety entirely. [3]

Last, don’t take the rather fine detail of Finnish dialects as meaning that they’re actually more different from each other than what we find within other groups — they’re just a) more numerous (even 100 years ago Finnish had 2× the speakers of Estonian, 60× the speakers of Karelian, 200× the speakers of Veps…) and b) better known to me. If I had been looking into e.g. Estonian dialectology in as much detail, I would probably have some opinions also on how to re-tool things around there.

[1] Yes, I am on Twitter as of the start of this year. Not explicitly announced on the blog before, though you may have noticed if you’ve checked my About page recently.
[2] One intriguing example is PF *kota : *koda- ‘house’, giving in my view early PCF *këta : *këða- > later PCF *këta : *kë.a-, whence Vt. kõta : kõa-; EEst. kõda : kõja-; NEst. koda : koja-. What is telling here is that Estonian -j- as a hiatus filler only seems to be regular after illabial vowels, thus showing that NEst. koda does not retain PF *o; it has instead undergone the development *kë-a > ko-a that also appears in cases like *këldajnën > *këllainë(n) > kollane ‘yellow’ (which has “primary” *ë < *e, not “secondary” *ë < *o; cf. Finnish keltainen).
[3] For general historical Fennistics purposes it’s in any case sufficient to know that any attestations in Kukkuzi but not in “normal” Votic can be always from Ingrian, be it by loaning or descent, i.e. not requiring reconstructing anything all the way to Core Finnic.

Tagged with: , , , ,
Posted in Commentary, Reconstruction

*-ətA adjectives in Mordvinic

Across Finnic and Samic, one of the more characteristic adjective endings is *-əta ~ *-ətä; yielding e.g. Finnish -ea ~ -eä, Estonian -e, Northern Sami -at. The Permic cognate *-i̮t is also at least relatively common. Because Of Reasons I have gone for a hunt for reflexes in Mordvinic, where no productive reflex survives. More specifically I’ve gone over Paasonen’s Mordwinisches Wörterbuch (a few more could be probably found in other sources). The scoop is as follows.

First, some cases well-known in the comparative literature. (Noticably often these have exact equivalents in Finnic, or indeed specifically Finnish).

  • *kalgədə ‘hard’ (> Er. kalgodo, Mk. kalgəda) < WU *ka/ëlkəta > Fi. kalkea [1]
    (MWB unwarrantedly lists this as a derivative of *kalgə ‘sheaf, etc.’, which is rather < WU *këlkə ‘haulm’)
  • *śejəďə ‘thick’ (> Er. śejeďe, Mk. śiďä) < WU *śikətä > Fi. sikeä ‘sound (of sleep)’
    (the Moksha form miscited in Uralonet as śäjiďä — a real form, but rather from some Erzya dialect that has *e > ä)
  • *taŋgədə ‘firm, stiff’ (> Mk. taŋgəda) < WU *taŋkəta > Fi. tankea ‘id.’
  • *valdə ‘light’ (> Er. valdo, Mk. valda) < WU *wëləta > Fi. vaalea ‘id.’
    (in UEW / Uralonet, Mordvinic incorrectly under the longer variant *wëlkəta)
  • *vijəďə ‘straight’ (> Er. vijeďe, Mk. viďä) < WU *wojkəta > Fi. oikea, NS vuoigat etc. ‘right’

We see here reflexes as *-ədə / *-əďə after a consonant cluster, syncopated *-də after a PU sonorant (but apparently not after single *k). Moksha śiďä, viďä are probably due to secondary post-Proto-Mordvinic syncope (unclear to me if with fusion *jď > ď or, as might be suggested with *ej > i in the former, with vocalization of the glide). Not many other cases follow this exactly, though. I find only one other clear example + one possible example in *-ədə:

  • ? *ľifčədə ‘loose’ > Mk. ľifčəda; from a stem common with e.g. *ľifčańa ‘pliable’, Mk. ľifčəm- ‘to relax’. Attested as both an ə-stem ľifčədə- and an a-stem ľifčəda- though, hard to tell which might be primary.
  • *vačədə ‘hungry’ > Er. vačodo, Mk. vačəda; from *vačə ‘hunger; hungry’

For *-də after CVR-, I find two more examples, and also two nouns that might derive from former adjectives:

  • Er. boďo ‘obese’. Perhaps distorted from *vojdo, and thus a derivative from *vaj ‘butter, fat’ (which in Erzya develops as > *voj > oj)? Still would have expected *-ďə, but there’s no possible soundlawful origin for an Erzya word ending in -ďo anyway…
  • *naŕďə ‘firm, tough’ > Er. naŕďe, Mk. naŕďä (no base root that I can identify)
    (update: or maybe from PU *ńërə ‘cartilage’??)
  • *śardə ‘elk, reindeer, deer’ > Er. śardo, Mk. śarda. Has clear cognates at least in Mari (*šårδə) and Khanty (*sūrtāj; Northern Mansi surti probably a loan from this), with the PU form usually reconstructed as *śarta. However I suspect this was originally rather an adjective *śarwəta ‘horned’ ← *śarwə ‘horn’. Loss of *-w- in clusters may have been early enough in Mordvinic and Mari to allow common syncope from *śarəda to *śarda. [2]
  • Mk. šoľďä ‘crazy person, crybaby’. Could this be from a common root with Finnic *hullu ‘crazy’ (both pointing to earlier *šul-)? The morphology of the Finnic word remains obscure though, and the palatalization in Moksha would be unexpected; maybe suggests something like *šuljəta. Alternately, maybe ‘crybaby’ is more original, and the Mk. word is instead from a common root with Erzya čoľeďe- ‘to chirp, trill’? Either way this would probably have been an original adjective.

There are however several adjectives ending in *-adə, derived mostly from stems already ending in *-a-. This contrasts with the suffix’s behavior in Finnic and Samic, where it always carries a 2nd-syllable *ə even when attaching to *a-stems (e.g. Fi. lauha ~ lauhea ‘mild (of weather)’, notka ~ notkea ‘pliable’). I suppose the widepread Proto-Mordvinic reduction of 2nd-syllable vocalism led to a reanalysis of *-ədə as just *-də, and then later on the rise of new cases attaching to different stems.

  • *kaladə ‘broken’ > Er. kalado, Mk. kalada; from a stem *kala- common with e.g. *kaladə- ‘to break (intr.)’, *kalaftə- ‘to break (tr.)’
  • *komadə ‘turned over’ > Er. komado, Mk. komada; from *koma- ‘to turn over (< PU *kuma-)
  • *naksadə ‘rotten’ > Er. naksado, Mk. naksada; from a stem *naksa- common with e.g. *naksaftə- ‘to let rot’, *naksalgadə- ‘to begin to rot’
  • *ozadə ‘sitting’ > Er. ozado, Mk. ozada; from *oza- ‘to sit’
  • *panžadə ‘opened’ > Er. panžado, Mk. panžada; from *panžə- (!) ‘to open’ (< PU *panča-)
  • *śťadə ‘straight, standing’ > Er. śťado, Mk. śťada; from *śťa- ‘to stand’
  • *štadə ‘naked’ > Er. štado, Mk. štada; from *šta- ‘to be exposed, cold’
  • *tajadə ‘stupid, grumpy’ > Er. tajado; from a stem *taja- common with e.g. *tajardə- ‘to be timid, dejected’, *tajaskadə- ‘to become grumpy’

Itkonen (1963, CIFU 1) has proposed to consider a chunk of these to be instead primarily adverbs, formed with the homophonic ablative suffix *-də, but I’m not sure if this is a good analysis: Mordvinic infinitives and participles are generally marked, not formed by appending case endings to a bare verb stem. Also, I would analyze *kala, *naksa, *taja to be primarily noun roots ‘brokenness’, ‘rottenness’, ‘unsatisfiedness’.

Still more interestingly, I can also find adjectives where the final vowel looks to have escaped vowel reduction.

  • Mk. aluda ‘underlying; under’. Another adverb/adjective, seemingly pleonastic from an unattested *aləŋ > *alu ‘underlying, undery’ (maybe ousted by the homophonic lative adverb: Er. alov, aloŋ, Mk. alu ‘(to) under’).
  • Er. čando, čonda ‘pricey; price’. Probably not a cognate of Fi. hinta ‘price’ as traditionally compared. MWB hesitantly but I think more likely correctly suggests a connection with Er. Mk. čana ‘price’, which is ← Ru. цена.
  • *pärda > Er. ala-berda, Mk. ala-pärda ‘missshapen’ (“under-pärda“). Probably still an independent word in PMo., given how Erzya and Moksha differ in if they adopt compound-medial stop voicing (“rendaku”, we might call it).
  • *säŕďa ‘fragile of old age’ > Er. seŕďa, Mk. śäŕďä; evidently from a common root with *säŕəďə- ‘to hurt, be sick’ (? < PU *särä-, though intriguing resemblance also with Finnic *särke- ‘to hurt’).
  • *šopəda ‘dark; darkness’ > Er. čopoda, čobda, Mk. šobda, šovda; from *šop ‘in a day, for a day’
  • *topəda ‘dark (of color), maroon’ > Er. topoda, Mk. tobda; from *topə ‘full’, the meaning apparently thru expressions like *topəda_seń ‘full blue’ = ‘dark blue’.
  • #ťožda ‘light’ > Er. čožda ~ Mk. ťožďä (no base root that I can identify). Reconstruction difficult due to several irregularities. Is Er. č- maybe by contamination with čova ‘thin, fine’?

In three of these, we find a similar environment to where PU 2nd-syllable *a survives: after a 1st-syllable *o < PU *u. Maybe the same would have originally allowed even retention of a 3rd syllable *a? — By contrast the disharmonic *pärda, *säŕďa pretty much have to be Mordvinic-internal formations. Could an adjective suffix *-da have been generalized / extracted just from cases like *topəda?

No further answers today; just a look at what other etymological candidates we might have in Mordvinic for residues of this ending.

[1] Close to a ghost word, though; kalkee ‘poor, low-quality’ is only known from one Finnish dialect. This can only really link to ‘hard’ thru kalki ‘poor, unlucky’ (“having a hard time”) from one early dictionary. The reported “dialect variant” kalkkea ‘loud, talkative, lively’ seems likely to be unrelated and instead from the verb kalkkaa ‘to ring (bell), make loud noise’ (many similar derivatives from this, also e.g. kalkas ‘lively’, kalkatti ‘blabbermouth’). — Estonian kalk, kalge ‘hard, brittle’ is a more reliable cognate in any case at least.
[2] In Mo. loss of *w probably postdates medial voicing though: by a few examples, *-tw- *-sw- seem to yield PMo. *-t- *-s-, not **-d- **-z- (at least *latə ‘shelter, roof’ ~ Finnic *latva ‘canopy’, *kas- ‘to grow’ ~ Finnic *kasva-).

Tagged with: , , , ,
Posted in Etymology

Will Someone Please Reconstruct Proto-Kurdish Already

Some things about comparative linguistics you might just take for granted in your own little corner of a particular language family, until you start looking at how they do things in others. In Uralic studies, we’ve known for 200+ years, and put into explicit practice since 150+ years ago, that progress requires documenting unwritten language varieties (just comparing literary Hungarian / Finnish / Estonian / Sami runs out of steam fast [1]). For 120+ years, even, that it’s additionally good practice to get detailed interdialectal comparison of such languages started sooner rather than later, not just rely on one well-known doculect.

The big dog of our Eurasian linguistic region, Indo-European studies, has of course an enviable access to a good bunch of attested Old Indians, Old Church Slavonics and Old High Germans, which are lot more directly compareable with each other. But you’d think the field would have somewhere during the 20th century understood at least that, yes, newer-attested languages will have contributions to make to the overall picture too. Remember e.g. how Nuristani, a little bunch of languages up in the mountains of Afghanistan, turned out to have the key evidence for affricate reflexes of *ḱ *ǵʰ *ǵ in Proto-Indo-Iranian, preserved several millennia longer than in Avestan or Sanskrit?

Where Slavistics, Baltistics, Germanistics, Armenistics, Romanistics have all still gotten their general comparative programmes rolling pretty well, Indo-Iranian keeps being a rock that drags behind pretty badly. Considering extra-scientific causes, this is not a giant surprize / is clearly in some amount thanks to these other sub-fields’ status as National Sciences in the various nation-states of Europe. Still, this would not have to be the case, it’s not like Celtistics has been left in the dust. Comparative linguistics also seems like something with sufficiently little direct political valence that it should be doable enough even e.g. under Iran’s current theocratic administration, let alone by the sizable and somewhat intellectual-leaning Iranian diaspora(s). Indian fans of the Out-of-India theory also demonstrate an existing if unorganized interest in linguistic history.

But indeed. Indo-Iranian is not just any random branch of Indo-European; it is today the largest branch (e.g. Glottolog counts 319 varieties, out of 581 Indo-European varieties altogether), and also the only one to preserve all of its known main branches since antiquity. Reflects more branches today than in history, really: already Nuristani is nowhere to be seen until the 19th century. By contrast, in Europe East Germanic, West Baltic, Continental Celtic, Aeolian Greek etc. are long gone. If anywhere in IE, it is in Indo-Iranian that we should expect to be able to reach quite deep time depths by collecting data from modern varieties and applying comparative reconstruction efforts as usual. Yet this generally seems to have not been done, and approximations derived mainly from Sanskrit and Avestan end up making do as Proto-Indic, Proto-Iranian, and the main fodder for Proto-Indo-Iranian.

By now there is clear evidence that this is insufficient. One informative case from recent years is Martin Kümmel’s observation that “secondary” word-initial h- in several Iranian varieties — at least Khotanese and many western Iranian varieties including Middle to New Persian — actually seems to be a retention of PIE laryngeals (especially *h₂)! This may not have been completely out of the blue. Laryngeal hiatus in Vedic (*aHa > *a.a > ā in some cases still parsing as two syllables) has been known since the early decades of laryngeal theory, and Cheung’s Etymological Dictionary of the Iranian Verb from 2007 takes an extremely cautionary approach of projecting all PIE laryngeals into Proto-Iranian, including an implausible-looking contrast between this *H and secondary Iranian *h < *s (and implausible-looking clusters like *Hhauš- ‘to dry out’). [2] Regardless we do see that it is incorrect methodology to treat any divergences from attested Old Iranian as innovations, and that this will fail to connect archaisms in marginal new Indo-Iranian varieties back with the wider programme of Indo-European reconstruction. The same has been very patiently explained by Kümmel too, in a 2016 paper “Is ancient old and modern new? Fallacies of attestation and reconstruction“.


I’ve picked Kurdish here as a semi-random example of a modern Iranian language group that probably deserves closer investigation in this fashion, though its western peripheral location might indeed make it a more likely location for archaisms than smaller languages more fully encircled by Persian. It quite clearly shares at least the propensity of retaining *h₂. Even just looking over the lexicon of standard Kurmanji as listed at Wikipedia readily turns up cases like hêk ‘egg’ << PII *Hāwyam < PIE *h₂ōwyom; hirç ‘bear’ << PII *Hr̥ćšas < PIE *h₂r̥tḱos (~ Middle Persian xāyag, xirs). However also cases like hesp ‘horse’, where some kind of “aspiration throwback” could be considered (*asp- > *esʰp > hesp).

The outlines of Kurdish historical phonology are known, of course. Relatively detailed discussion is readily found in sources like Asarian & Livshits (1994), or at Iranica Online. What seems to be missing from these accounts, however, is any real integration of variation among the Kurdish “dialects” (by now widely thought to comprise at least 2–3 languages). They also spend much effort on lamenting difficulties in telling what might be native Kurdish words and what loanwords from Persian or Zazaki or some other neighboring Iranian variety; same as in many other studies on individual western Iranian varieties. But we — at least e.g. us Uralicists — know quite well that attention paid to dialectology is often able to resolve such issues! Maybe some Kurdish variety would turn out to display a form different from the others that would then need to be considered the native one; or to display a different loanword substitution, pointing in favor of relatively recent loaning, whether from Persian or not. Dialect differences could also help with relative chronology, in telling late areal changes (and across Iranian these are many) apart from what really are early Proto-Kurdish innovations. The retained laryngeals, too, are noted by Kümmel to not be entirely systematic. Conceivably it could be the case that e.g. Kurdish only gets them through Persian, at some older or newer date. Or inversely, maybe Kurdish might be in its native vocabulary more systematic about this than Persian is. No way to tell before looking.

Let me be clear here on the proposal. A reconstruction of e.g. Proto-Kurdish should not rely on just some handful of already available descriptions / dictionaries (though I’m sure their comparison, too, would already add up to several results), nor aim just for identifying phonological variation. The goal of such a project should be primarily lexicogeographic: to have detailed enough dialectological picture to be able to see the directions of vocabulary spread, to tell local innovations apart from local archaisms. In Uralic studies, when putting together an understanding of, or at least the data for understanding, Proto-Samic, Proto-Mari, Proto-Permic, Proto-Mansi, Proto-Khanty, Proto-Selkup, etc., we have routinely based this on low double digit numbers of varieties, each documented at least to low quadruple digits of vocabulary. And these are all smallish language groups, spoken by some tens or hundreds of thousands of people. The Kurdish languages have tens of millions of speakers altogether. Even if extensive fieldwork in Kurdistan were to look too dangerous or politically complex right now, already connecting with the diaspora communities worldwide should be easily able to provide data on some dozens of varieties.

I do not pretend that this would be a small or quick task (it is clearly beyond what I or anyone could accomplish as just one unattached researcher), but it seems like a very doable task, and likely fruitful, not just for the circles of Kurdish studies or Iranian studies, but for Indo-European studies altogether. And by no means is this a gigantic endeavor either. This could be all done in under a decade by one research group, if there was first of all the will for it to happen (be funded and prioritized).

Closing up this plea, let me also suggest one other hypothesis that could be up to something. In existing overviews, Kurdish is reported to “sometimes” show PIr. *x > kʰ, e.g. *xara- > /kʰer/ ‘mule’. The facts that this development (1) fails to be regular and (2) seems to be a regression (alleged PIr. free-standing *x is < PII *kʰ) should already suggest that it is perhaps an archaism rather than an innovation. The same might go for /tʰ/ from “PIr. *θ” < PII *tʰ, reported at least in *θaiwar- > /tʰiː/ ‘brother-in-law’. This interpretation is not airtight off the cuff by any means: both Armenian and Semitic influence could have encouraged secondary introduction of aspirated stops. But, interestingly, on a brief look-around I do not find cases where Persian /x/ ~ Kurdish /kʰ/ would derive from a secondary *x that continues *h₂, only cases with PII *kʰ. From the former, the result seems to be /h/, as above in e.g. ‘egg’. So did Kurdish regularly shift *x > /h/, while never shifting *kʰ? Again, detailed dialect evidence could perhaps swing this either way. One of these decades we will hopefully know better.

[1] Yes, written Sami already existed 200 years ago, indeed since the mid-17th century. The first variety to have been standardized to some practical extent was so-called “Old Swedish Sami”, a clergy-designed form from the mid-18th, based most closely on Ume Sami though aimed as a general western interdialectal standard. Standard Northern Sami took its first steps around the same time as well.
[2] Omniretained laryngeals are furthermore trouble also for e.g. RUKI. If we have *s > *š in e.g. *buHs- > *buHš ‘to endeavor’, as if triggered by a preceding *H and not *ū, why not also in e.g. *yaHs- > *yaHh- ‘to girdle’? Without the assumption of universal laryngeal preservation, though, this could be easily resolved by assuming *eH >> *ā as an independent vocalization from *iH/*uH > *ī/*ū. Note also a further but welcome corollary: if we do go with thinking that RUKI in *buHš- has been triggered by a long-retained *H, then also *Hhauš- will have to be simplified to just *hauš-, indeed already to a pre-RUKI late PIE *sews- < early PIE *h₂sews-.

Tagged with: , , , , , ,
Posted in Commentary, Methodology

Phonological Renormalization

A small definition of a concept.

Across the dialectology of various languages we very often find almost the same segment inventory despite various innovations. I call this phenomenon “phonological renormalization”. It seems somewhat mysterious at first: it is hard to see any way how a language’s status as a part of a large dialect continuum could outright prevent innovative phonological features from arising. However, it does seem to me that there could be an easy way out — by assuming a slight diachronic detour: suppose that new innovations do arise, they simply then afterwards change back into another segment already known in the language’s other dialects. A sufficiently homogeneous “sociophonetic environment” could probably motivate novel phonological segments to often merge with pre-existing close matches. Specifically, learners / innovating speakers faced with the prospects of (1) adopting an innovative form and (2) adopting an innovative segment might prefer the former, but still avoid or fail to manage the latter.

This kind of hammer-down-the-nail development can be sometimes directly attested. Both stages are historically recorded e.g. for the fate of *ð [1] in some western dialects of Finnish: first flapping, to create a new phoneme /ɾ/, then merging with the usual trilled rhotic /r/. Or, in Eastern dialects of Finnish: early loss of medial *-n- in the allegro forms of the 1PS and 2PS pronouns minä, sinä evidently first created a rare transient diphthong /iä/, recorded only in a narrow southwestern corridor from Mikkeli to Hamina; but elsewhere renormalized to /ie/. Even in Mikkeli it has been fortified by the late Savonian–Karelian diphthongization of *ää to /iä/. I also wonder if the Western Finnish allegro forms , actually continue older miä, siä with different renormalization (but before the lowering /ie/ > /iä/).

My aim is not to list extensive amounts of evidence here. Renormalization is pretty easy to notice once you start looking, maybe especially among consonants and/or small phonological inventories? But just off the top of my head, a few other conspicuous examples that come to mind include the following:

  • Glide epenthesis in Finnic and Samic. Both language groups generally turn *w into labiodental /v/ = usually an approximant [ʋ]. Whenever any kind of a labial glide develops later on, e.g. before word-initial /oː/, or in hiatus following any labial vowels, or from lenited *b, this likewise tends to produce /v/, not [w]. Occasionally a [w] can be attested, e.g. in Standard Finnish in cases like *kauɣan > kauan [kau.an ~ kauwan] ‘for long’; for which maybe most dialects however show instead kauvan.
  • Preaspiration *TT > *ʰTT in Samic. This affects first the native Uralic geminates already in Proto-Samic; slightly after that, across western Sami, also secondary geminates introduced by consonant gradation.
  • Lenition-plus-fronting *ɣ > /v ~ j/ in Mordvinic (/v/ in back vowel environments, /j/ in front). This affects first *ɣ from PU medial *k across the whole group, slightly after that also *ɣ from the lenition of *ŋ, across all of Moksha and most of Erzya.

At least the second has been even proposed to actually represent a single innovation, which would postdate *Cˑ > *Cː in western Sami but predate it in eastern. An equivalent scenario, with old *w and *ɣ preserved until the rise of secondary cases, could be sketched from the other two too. But perhaps the phenomenon of renormalization should be a sufficient explanation.

In more general, I suspect this phenomenon could perhaps end up accounting for much of what is normally called “cyclicity” of phonological rules. In a language that e.g. aspirates its initial stops (say, English) it’s not that a newly born /t/ from a source that does not already have aspiration (say, from /θ/) would have to be instantly aspirated, since we do sometimes find allophonic contrasts phonemicizing in this fashion. But if this did create a three-fold contrast /tʰ/ : /t/ : /d/, I predict that this should prove unstable and be quickly followed by an additional merger [t] > [tʰ]. After this the innovation could also no longer spred to other speakers as a “phonetic” change [θ] > [t], only in a purely phonological form /θ/ > /t/ [tʰ : t], potentially quickly leaving any small original areas of /tʰ/ : /t/ in obscurity.

And I wonder further… other kinds of reiterated sound changes could find similar explanations for them too. Juliette Blevins, for example, has a recent paper observing that Austronesian *q has disproportionately few reflexes that are actually /q/ She proposes that this should be taken as evidence that the phonological stability of uvulars is conditional on a language’s vowel system, which might not be all wrong. (Cf. some further comments from me @ Tumblr.) But could it be additionally the case that an area having many languages with /ʔ/ creates a pressure for other languages to “normalize” (hardly “re-“) even an inherited, native *q into /ʔ/…? Or, indeed, even a *k? setting thus the stage for the famously common-in-Oceanic chainshift *t *k > /k ʔ/.

Of course though, this goes both ways and it’s also possible that many cases could be accounted for simply by internal phonological instability. [2] Already in the above examples: e.g. a contrast [w] : [ʋ] would be itself pretty rare, ditto the western Finnish system with all three of /ɾ r rː/. Still, at least the specific direction in which these systems seem to collapse does look to be largely determined areally, and e.g. no cases of the chainshift *r *rː > /ɾ r/, known from places like Ibero-Romance and Albanian, has been described from any western Fi. dialect.

[1] Reminder for the map reader: Kettunen’s ð is usual Finno-Ugric transcription for the alveolar tap = IPA /ɾ/, while δ is the voiced dental spirant = IPA /ð/.
[2] Reminder thanks to my wife Sara Carrier-Bordeleau.

Tagged with: , , , , , , ,
Posted in Methodology

Revisiting Setälä’s *pk

In 1907, E. N. Setälä published one of his last comparative linguistic works: [1] “Finnisch-ugrisches pk (~ βk)” (in FUF 6; nominally dated to 1906), on a minor addition to the cluster canon of Proto-Finno-Ugric. This was a follow-up to some discussion in early 1907 in Virittäjä by Paasonen and Setälä. [2] The idea has since then gone without much attention, either for or against. At least one of the proposed comparisons, supported also by Paasonen — Finnic *tukka ‘hair’ ~ Mari tupka, təpka (*tŭpka) ‘tuft, bunch’ — survives as late as Collinder’s Fenno-Ugric Vocabulary (1955, 1977: 63) and Comparative Grammar (1960: 87–88). Even this last case is, however, quietly dropped in later references, I think starting with Suomen kielen etymologinen sanakirja (tukka in vol. 5 from 1975) and absent also in the UEW. A look at the original work reveals that also cognates from Komi were proposed by Setälä: tup-jura ‘tuft-haired’, tup-jur ‘owl’ (= “tup-head”), tupka ‘owl’. Its removal does make sense, as by Collinder’s time it was already known that Komi /u/ normally does not correspond with Finnic *u ~ Mari *ŭ < PU *u (we would rather expect /ɨ/).

A cluster occurring only in one word could be surely deemed fairly uncertain, and other etymological directions seem to exist also for both the Finnic and Mari words. Before looking more into these though: what of the rest of Setälä’s data? He presents no less than 9 examples in his articles, which would be already more than there are examples of some regardless generally accepted PU clusters. I summarize the data below in a table (reordered, glossing simplified somewhat, some variant forms omitted):

glossFinnicSamicMordv.MariPermicHung.
‘to beat,
chop’
*hakkat-*cōvkkē-*čaka-,
*čuka-
*ćapkɨ-csap-
‘(to) kiss’Fi. suukko
SE tsiuku
N cuvkit
‘to smack’
*ćup,
*ćupked-
csók
*hukka
‘loss’
(S–N *hāvkkë-
‘to suffocate’)
K. /šupkɨ-/
‘to throw’
‘to block’*tukkë-Lu–I *tëvkkë-K. /tupkɨ-/
‘to drip’Fi. tiukku-Ud. /ťopkal-/,
/ťopkat-/
‘to beat’
(of heart)
Fi. tykki-Er. tykno-K. /ťopkɨ-/
*öökkät-
‘to vomit’
K /ɨpkɨ-/
‘to sigh’
‘hair, tuft’*tukka*tŭpkaK. /tup/
*kokka
‘hoe’
kopka
‘plough’

The consonant center representation indeed looks fairly regular, especially Finnic *kk and Permic medial *-pk-. Reflexes elsewhere are more scanty, and in particular no Ob-Ugric data appears at all. Unfortunately, even besides this we have several reasons right off the cuff to suspect that these are not reliable etymologies.

  • No regular reflex is established for Hungarian. We have instead one case of p, one of k.
  • An abundance of onomatopoeia / ideophones or at least meanings susceptible to this kind of origin: ‘beat’, ‘kiss’, ‘drip’, ‘vomit’. Many would have parallel variants, e.g. Fi. sykkiä besides tykkiä.
  • Poor within-branch distribution is common: we have just Finnish in two of the Finnic cases, just Northern Sami in one of the Samic cases, and just Komi in five and just Udmurt in one of the Permic cases. Some could be supplemented by newer data though, e.g. Moksha does seem to have /təkna-/ ‘to beat (heart)’, and Komi /ťopkɨ-/ ‘to drip’.
  • Some lax semantics. In the 3rd I have no idea what the basis for the comparison between Finnic and Komi is supposed to be. The ninth is not very promising at all either: a bit off to begin with, and per the more detailed data of Moisio & Saarinen, kopka does not mean simply ‘plough’, but rather ‘flat center part of the plough, where the ploughshares are attached’, rather further away from ‘hoe’.
  • Onset mismatches; at least S. *c- and Mo. *č- (suggests PU *č) versus Permic *ć- and Hu. cs- (suggests PU *ć) in the first and second; Fi. t- ~ Permic /ť-/ in the fifth and sixth. General Samic /h-/ in the 3rd is also strictly a loanword consonant, and Setälä does proceed to propose borrowing from Finnic, but still before his proposed assimilation *pk > *kk.

More trouble still comes indirectly from finer details. For one, Permic morphophonology: though consonant clusters often simplify to their first member word-finally, there are no known cases with an alternation /-p/ : /-pk-/ as would be predicted to exist from this (including not in *ćup ‘kiss’). For two, the Mari stem structure CVCCA looks suspicious: 2nd syllable vowels are usually lost in nouns, and even when they do survive, it’s usually as Proto-Mari *-ə, not as *-a. Same in a few cases in Finnic: overheavy syllable structures like *suukko, *tiukku-, *öökkä-t- are not typical for native vocabulary. And even Mordvinic: unpalatalized /t/ + front vowel /i/ (with an allophone [ɨ] that Setälä notates as y) has no known native origin. So altogether a bunch of this data does not even look native.

Even after these observations, some basis for *pk could be perhaps still salvaged. But the death knell I think is the near-complete absense of any regularity in first-syllable vowels. Only one of the eight load-bearing Finnic / Permic comparisons has good parallels: uu ~ *u, regular from PU *ow. A few cases of y ~ K. /o/ are known too, but these have a conditional explanation: assimilation *e-ü > *ü-ü early on in Proto-Finnic. [3] Many other correspondences like Finnic *u ~ Samic *ā are also firmly irregular. Hence I will be happy to think that, yes, this article and its etymologies were in error and no cluster **pk is to be reconstructed for P(F)U.

What then of the existence of a cluster /pk/ in Mari and Permic? I think this is explainable, just by morphology rather than phonology. In Mari these should be clearly considered derivatives *tŭp-ka, *top-ka, with a reflex of the common PU diminutive suffix *-kka. This source of /pk/ is already clearly evident in other cases, e.g. lap : lapka ‘low(-lying)’, šapə : šapka ‘faded’. In Permic, then, note first that /pk/ is primarily attested in verbs. I would similarly segment here roots ending in /-p/, plus the PU momentane suffix *-kə-. This is not generally productive in Permic, but traces of it have already been identified in various cases (UEW even derives *ćapkɨ- from its *ćappɜ- ‘to hit’… *a > *a remains irregular though). This seems clear at least for ‘to drip’, where even /ťop/ ‘drop’ has been attested. The involved word roots, as mentioned above, do seem to be largely simply onomatopoetic.

One more Uralic variety is also known to have /pk/: Southern Sami, among this data only in hapkedh ‘to choke’, but other cases exist too. In Setälä’s view, /pk/ ~ /vhk/ (< *vkk) would be different generalized grades of an old alternation pattern *pk ~ *βk, but no direct evidence whatsoever exists of such an alternation. I wonder if a phonological solution could be still sought: *vkk > /pk/ might be an old regular sound change in SS.

One clear loanword, SS haapkie ‘hawk’, is alas not evidence for such a change. Other Samic reflexes like Lule hábak point to loaning already from Proto-Scandinavian *habukaz (→ PS *hāpëkkē + later syncope in SS), [4] not from attested Old Norse haukr (which could have yielded PS **hāvkkē). However, I suppose that also western Samic *hāvkkë- is still a loanword from Finnic; the source is just not Setälä’s *hukku- ‘to disappear, drown’, but rather *haukki- ‘to gasp for breath’ (+ other meanings). This has been attested from most of Eastern Finnic, e.g. Karelian haukkie, also dialectal Finnish haukkia; its standard Finnish variant haukkoa seems to be actually more narrowly distributed altogether (and thus younger?).

Still on the other hand, a development *pk > *vkk or maybe straight to /vhk/ would make more sense within the general dialectology of Samic, where we also have innovations like *šk > /jhk/ across all western varieties. [5] A likely intermediate looks to be *fk, which is actually the normal Kola Sami reflex of *vkk. Lehtiranta’s Proto-Samic reconstruction already takes this stance, giving PS *cōpkë- for a similar correspondence in SS tsuopkenidh ‘to break (intr.)’ ~ other Samic *-vkk-, e.g. NS cuovkut ‘to break (tr.)’. This would also allow supposing that attested cases of /vk/ in Southern Sami do continue PS *vkk and are not newer loans from other Sami varieties: from Lehtiranta we have jaavk-udh ‘to appear’ ← *jāvkkë- ‘to disappear’, raavkedh < *rāvkkë- ‘to demand (back)’. But then we face again the question of explaining the origin of SS /pk/. For *cōpkë- ‘to break’, the same approach as in Permic is perhaps not impossible: is it also a relict *-kə-momentane from an onomatopoetic root *cōp- < *čap(ə)-? Alas for hapkedh this will not readily work. Still for that matter it also shows short /a/ which does not match the cognates I’ve proposed to reflect Finnic ⁽*⁾haukki-… Extremely speculatively I could entertain an idea of this to be instead from a PS *θëp(pë)-, as a cognate of Finnish–Karelian *tüppe-htü- ‘to be extinguished, out of breath’ [6] that has indeed been amended with *-kə-; i.e. pseudo-PU *ďüppə-kə-?! Usually though this Finnic verb has been considered a parallel derivative to *tüpp-i- ‘to block, close’ which furthermore also has a known Samic cognate *tëppë- ‘id.’ > SS dahpedh (with /t-/, not /h-/ < *θ- < *ď-). I am not sure if the existence of words like Es. läppama ‘to choke’, NS lahppasit ‘to be out of breath’, Mordv. *ľäpija- ‘to choke’ are worth anything: they don’t correspond well with each other, but they could suggest an old ideophone of lateral + *pp for ‘to choke’, and my *ďüppə- could also fit under this pattern if *ď- had been originally lateral. But for now this is at best a stretch.


This post was originally inspired by some observations on a possible different etymological origin of one of the involved words… it would be, by now, however an entirely different tangent, and I may return to that topic instead later.

[0] In case this post seems like an excessive amount of effort to spend on forgotten crappy etymologies from 115 years ago, cf. further my older discussion of “anti-etymologies“. It is very possible that the poorness of these comparisons would not be apparent to some people happening upon Setälä’s work! They also have given me an opportunity to talk a little about some other topics that have been on my mind, such as /-kɨ-/ as a Permic verb suffix.
[1] In his later years he would be much more involved instead in the politics of newly independent Finland.
[2]Alkuperäisestä -pk-sta on suomessa tullut -kk-“; “Alkuperäistä -pk-ta ja sen heikkoa astetta edustaa suomessa -kk- ja -uk-.” (Neither currently available online, but perhaps in the future.) Setälä’s article also reports that the editing of FUF 6 had been finished, but the issue had not yet gone to print, by the time Vir. 11/1 appeared in late February 1907.
— TBH, to me it would seem like an amazing coincidence that both scholars had been planning to publish on the same minor sound change at almost exactly the same time. Since in our time it is known that Setälä in his later years had a track record of stealing discoveries from other scholars, I do have to wonder if he is here too trying to claim priority from Paasonen on the three comparisons he advances (those of tukka, tukkia, kokka) by sneaking a small article in at the last minute into his own journal. He did clearly come up with the idea of the correspondence F *-kk- ~ P *-pk- though: the comparison of suukko ~ cuvkit ~ K. ćupköd- appears already in his 1896 article on consonant gradation (SUSA 14).
[3] PF *lülü ~ K. /lol-/ ‘hard heartwood’; PF *süntü- ‘to be born’ ~ K. /sod-/ ‘to multiply’; PF *süttü- ‘to be ignited’ ~ PP *sɔtɨ- ‘to burn’; see recently Aikio (2021). To be fair, in the cluster of tykki- we do have Fi. tykyttää with cognates also in Karelian and Ludian; but also a morphologically primary-looking variant tykkä- is attested, stretching wider still to Ingrian, Veps and Estonian.
[4] Perhaps also not directly from Scandinavian, but thru Finnic *habukka (> standard & western Fi. haukka but e.g. eastern Fi.–Krl. havukka, Lu.–Veps habuk).
[5] Traditionally considered the defining innovation of a Western Samic subgroup, but I would agree more with a division into South–Ume versus Rest being older, as argued in recent times (future blog post on this perhaps coming).
[6] Inspiring also modern Finnish typpi ‘nitrogen’ as a back-derived coinage.

Tagged with: , , , , , , ,
Posted in Commentary, Reconstruction