(Part ca. 3 of n in my irregularly scheduled series of Introducing Named Soundlaws in Uralic Studies. [0])
The issue, as I see it
Most of the vowel correspondences we now think to be regular between Samoyedic and the rest of Uralic are those that were outlined by Janhunen in 1981. The actual sound laws behind them have regardless often gotten re-tooled or re-dated by now, much in the same way how many of them already had earlier precedents in some form (primarily from Lehtisalo or Steinitz). E.g. the chainshift *e > *i, *ä > e has been by now shown by Helimski to be post-Proto-Samoyedic, given Nganasan evidence for *e > †e > i̮. On follow-up, also the reflexes of *ä > “*e” can be relatively open in some languages: Salminen (2012) has pointed this out about modern Forest Enets (e.g. *tät³tə > tät ‘4’), and to me it seems e.g. that the conditional developments *ä-a, *ä-å > *a in pre-Selkup also seem to presume an open value for *ä. Cf. *ān-uj ‘true’ < PS *änå, or *kuəsə ‘iron’ < *wåsV < *wasV < PS *wäsa.
What I call “Janhunen’s Law” is, though, not any sound change in Samoyedic, but a proposal that he had in the same paper for an innovation in some uncertain amount of western branches: PU *oCə > *uCə. Sammallahti (1988) indeed adopted it as an already Proto-Finno-Ugric innovation. Since then though there does not seem to have been too much support for it — but then neither critique or any other analysis either.
On any kind of closer look, it does seem clear this cannot be quite as simple as Janhunen suggests. First of all, also a correspondence western *o ~ PS *o exists. Janhunen identifies two examples: *koj(-wV) ~ *koəj ‘birch’, *kopa ~ *kopå ‘bark’. This number can be increased: clear examples also include *koj(ə)ra ~ *korå ‘male animal’; *kokə- ~ *ko- ‘to check, see’ (all of these with *ko-, but this looks simply accidental; *ko- > *kå- can be also attested in e.g. *kåmpå ‘wave’, *kåsə- ‘to dry’, *kåət ‘spruce’). Possibly also *ńoxə- ~ *ńo- ‘to pursue, hunt’, though Janhunen assumes that Finnic *nouta- continues earlier *ńux-ta-, thru a similar lowering as in *sou-ta ‘to row’ ~ PS *tu- < PU *suxə-, and this does not look entirely impossible.
I’ve observed already long ago (first presented at the 2nd International Winter School of FU Studies in Szeged in 2014) that there seems to be evidence for further conditioning. First, all of Janhunen’s positive examples involve front consonants in the medial consonantism: alveolars and labials. Four cases are immediately unambiguous:
- *lumə ~ *jom ‘snow’;
- *kusə- ~ *kot- ‘to cough’;
- *purə- ~ *por- ‘to bite’;
- *tulə- ~ *toj- ‘to come’.
I would add first of all two cases that should be reconstructed with *-w- and not, as proposed by Janhunen, *-x-:
- *śuwə ~ *śo(-j) ‘mouth, throat’; *-w- is clearly indicated by Southern Sami tjovve.
- *tuwə ~ *to ‘lake’; *u reflected at least in Permic *ti̮. Original *-w- seems to be indicated by Northern Khanty *tŭw, Konda tŏw, and maybe the oddly front-vocalic təw in rest of Southern Khanty. [1]
Probably even a third is *luwə ~ *lë ‘bone’. *-w- is again indicated by Western Khanty forms — mostly rhyming with ‘lake’, e.g. Konda tŏw, other Southern təw, Nizyam tŭw, Kazym ɬŭw (but in Obdorsk lăw, versus tuw ‘lake’). Samoyedic *ë could indicate a shift *ëw > *ow in other languages already before *o-ə > *u-ə (a tentative Proto-Finno-Ugric innovation — though this seems a bit too trivial and devoid of parallels to be relied on for that).
One additional example that was not known to Janhunen shows a palatalized alveolar medial: *wuďə ‘new’ ~ *oj- > North Selkup oć-əŋ ‘again’, a neglected etymology from Helimski (1976). [2] Note further that positing *o > *u here explains the rare initial combination *wu-, not reconstructed anywhere else in Uralic vocabulary and probably phonotactically impossible in Proto-Uralic proper.
Looking beyond Samoyedic, it also seems to be the case that from the evidence of other languages, we cannot really reconstruct word roots of shapes like *CoPə, *CoTə, *CoRə. The best two contenders are *monə ‘many’, *wolə- ‘to be’, but the first is readibly under doubt as being a loan from Indo-European (also Permic *-mi̮n, Mansi *-mān, Hungarian -vAn in names of decads does not particularly have to be related to ‘many’ in Finnic and Samic), and the latter looks more likely to have been *walə-. On the contrary, many reconstructions of the shape *CoKə have been already presented: at least *jokə ‘river’, *rokə- ‘to hack, cut’, *soŋə- ‘to enter’, *šokə- ‘to say’, *toxə- ‘to bring’; maybe also e.g. *poŋə ‘bosom’, *oŋə ‘hole’ (if not rather *poŋŋə, *aŋə). I take this also as grounds to suppose that there has indeed been a sound change *-oCə > *-uCə, for C ≠ velar.
I suspect also palatal *-j- might have blocked raising: cf. *kojə ‘male’ (though this is mostly continued in derivatives like *koj-ma, *koj-ra). An interesting case on this front is ‘to swim’, usually reconstructed as *ujə- per Finnic (Finnish uida, Estonian ujuma etc.), but most cognates (clearly at least Samic *vōjë-, Mordvinic *uj-, Permic *uji̮-, SKhanty üj-) better point to *ojə-. As I’ve noted by now in a talk from 2018, even within Finnic, Livonian vȯigõ (? < *oi-kV-) seems to still retain *o. The reflex in Samoyedic, on the other hand, mysteriously enough, is still indeed *u- or *uj-.
An alternative view?
The only counterproposal in any clear detail that I’ve seen comes from Jaakko Häkkinen, first in his Master’s thesis and later, much more briefly, on his 2009 paper on locating Proto-Uralic. He suggests inverting Janhunen’s Law, to apply in Samoyedic and not outside of it: *CuCə > *Co(C). I have seen / heard something similar by other colleagues in a variety of discussions, but I do not recall any defense of this being published. At most, see some discussion in this blog’s comments starting here, with Ante Aikio listing some notes about *o ~ *u variation within Samoyedic and additional irregular-looking examples of *o. Among these I would doubt at least the reconstruction PS *počå- ‘soak, ooze’, though. This probably refers to the words appearing in UEW under *poča- ‘become wet’; but Nganasan and (with irregular b-) Kamassian seem to point rather to *påTå-, with evidence for *o limited to Nenets–Enets. Or, since (old) Nganasan fo- can continue not just *på- but also *pə-, and Enets has o < *ə regularly, another option, maybe better still, would be that this was *pəčå- in PS after all, as would be expected per the Udmurt, Khanty and Mansi cognates; and that the Nenets word is a loan from Enets, while the Kamassian word doesn’t belong here at all. (Donner’s original data actually has not just a voiced b but palatalized bʲ, which is also difficult to explain.) In some other examples I don’t see any particular reason to think that they point to secondary *u > *o rather than secondary *o > *u (thus so maybe in “*num” ‘heaven’) or to *o at all (thus so in Nganasan tui ‘fire’ for expected ˣtüi: this looks like unclear retention of *u, which has other parallels).
Anyway, the major problem that I see in the inverted approach is explaining where Proto-Samoyedic *Cu(C) then comes from. There is solid evidence at least for a rime *-uj:
- *tuj ‘fire’ < PU *tulə (a minimal pair with *toj- ‘to come’!);
- *uj ‘pole’ < PU *ul(k)ə;
- *kuj ‘spoon’ < PU ? *kujə (cf. Finnish kuiri ~ kuiru ‘id.’; I am not committed either way on if proposed Komi and Ob-Ugric cognates meaning ‘trough ~ mortar’ belong);
- *puj ‘eye of a needle, etc.’ < *pujə.
The last two probably show PU *-jə > ∅ and PS *j as some derivative suffix, [3] but this alone cannot explain *u rather than *o, since also the latter readily occurs in CV stems: *ko-, *ńo-, *to, *śo-j. A few PS roots also show *u: natively at least *tu- ‘to row’ < PU *suxə; of unknown origin, *ku- ‘cord’, *ju ‘warm’ [4]. Some other CVC examples can be found too, including *pur ‘smoke’ < PU *purkə; *ut ‘road’ < PU ? *uktə. But at least these two examples we might argue to be irrelevant due to continuing PU *u in an original closed syllable, just with exceptional loss of *-ə after some probably very early cluster simplifications.
As comes to the lack of PS roots of shapes such as **Cup, **Cun, **Cuŋ, this could indicate that something happened to such cases, but it doesn’t follow that the result must have been *o. Other options would readily include reduction to *ə, already suggested by Janhunen in e.g. *təŋ ‘summer’ < PU *suŋə.
Future hypotheses
So far I do side with the hypothesis that Janhunen’s Law is a real phenomenon. Its exact extent and conditions seem to require review, however. I have some reasons to suspect that PU *o was in *CoCə stems retained not just in Samoyedic, but partly also elsewhere. E.g. *purə- / *porə- ‘to bite’ yields in Permic *puri̮-; *tulə- / *tolə- ‘to come’ yields in Mari *tola-; both more in line with development from *o than *u. An interesting recent discovery, premiered a few weeks ago on Twitter, has also been to note Khanty *lāńć ‘snow’ (> e.g. Surgut ɬ´åńť, Nizyam tɔńś, Obdorsk laś). UEW derives this from a distinct *ľomćɜ, listing here also some derivatives of PS *jom and probably incorrect Kola Sami reflexes meaning ‘frost’. But if we did reconstruct *lomə and not *lumə already in PU, the Khanty words, too, can be simply considered derived reflexes, at the PU level seemingly *lom-ća: *o-a > *ā is regular, and there does not seem to be counterevidence to assuming *mć > *ńć. Closer review might identify more cases like these that support the reconstruction of PU *o in the involved words.
As more of a long shot, there are also two unclear cases where evidence for *o might be found in Indo-European. For one, ‘to bite’ seems compareable with PIE *bʰe/orH-, root meaning probably ‘to strike, pierce’. The PU verb also probably meant specifically ‘bite thru’ (in contrast to *soskə- ‘to chew’), coming fairly close to ‘pierce’. Its descendants can be also used not of just biting with teeth, but also working with tools (cf. e.g. Fi. sahanpuru ‘sawdust’, as if “saw-biting”) — similar later development is attested in derivatives on the IE side too (Latin forō, Germanic *burō- ‘to bore, drill’) [5] and LIV goes as far as to give a gloss ‘mit scharfem Wergzeug bearbeiten’. Distribution all the way into Samoyedic makes it difficult to assume loaning, though, while a hypothesis about an old Indo-Uralic cognate would not, at the current state of research, rule out an original *u that was lowered to ablauting *e/o in PIE. — For two, there is Finno-Mordvinic *unə ‘sleep’, which Koivulehto (1991) has already compared with Greek ὄναρ, ὄνειρο- and explained exactly thru Janhunen’s Law: early IE *oner → early Uralic *onə > *unə. Whether the Greek word goes back far enough in IE for this to be feasible looks very dubious to me though, especially when there is a much better-attested PIE word for ‘sleep’, *swépnos.
A yet further possibility I would wish to look into in more detail in the future is, does the raising of *o that we seem to see really have the “same” *o as its starting point as is usually reconstructed in PU? Namely, traditional PU *o is in Samoyedic by default lowered to *å — such that its “survival” in Janhunen’s Law cases really looks to be also innovative really. As outlined in yet another presentation a few years ago, I have also developed a hypothesis that the unbalanced inventory of rounded vowels in Proto-Uralic: *ü *u *o but no **ö, probably comes by a chainshift from pre-PU *u *o *ɔ. (I have not discussed this on the blog in detail so far and, alas, cannot do so right now either.) Then, the common tendency of PU *o to be lowered to *a / *å probably indicates that this chainshift had actually not fully taken place by PU: that “*o” was really still open-mid *ɔ. Janhunen’s Law positions, however, look like they might have already had close-mid *o. This would allow us to do away with a raising that happened all across “Finno-Ugric” with seemingly no motivation, while still also not folding the vowel correspondence entirely into PU *u.
There would be also another option on the relationship of this *o with my pre-PU *u *o *ɔ. Rather than early raised cases of (pre-)PU *ɔ, they might be also straggling non-raised cases of pre-PU *o… And then was this *o really just an allophone of *ɔ either? *u is a very common vowel in PU, and perhaps this is partly because even some further cases should be likewise reconstructed as *o. This might be possible if we identified other evidence for it than retention as *o in Samoyedic. For the sake of example, one case might be Mansi *u: PU *u yields in Proto-Mansi either *u, *ŏ, *ă with no very strong conditioning apparent. (Some similarly open issues remain in Khanty and Hungarian.) So just maybe … could it be that PMs *u is a sign of PU *o as distinct from both *u and *ɔ in general? such that not only will we then reconstruct PU *por- ‘to bite’ (> PMs *pur-), but also e.g. *końćə ‘urine’ (> PMs *kuńćə), with *o > *u now also in Samoyedic in this environment (> PS *kunsə)? This would even have a good parallel among the front vowels: PMs *i is generally from PU (close-)mid *e, not from close *i. — But in the interests of putting these notes finally out at least in a somewhat assembled form, I will leave this line of thought open for now.
[0] See previously at least: Lehtinen’s Law; Moosberg’s Law; and one that definitely requires a name but I’m still mulling over what to call it precisely is *Ä-backing in Finnic. Several future installments remain planned too.
[1] On the contrary, an irregular fronting already in Proto-Western Khanty would also account for most of these reflexes: *tŭɣ > *tü̆ɣ > *tĭɣʷ > *təw, preserved in SKh and giving NKh *tŭw (cf. e.g. ‘fall’: PKh *sü̆ɣəs ~ *sü̆ɣs > SKh səwəs ~ süs, NKh *sŭws or *sūs). But it seems preferrable to me to restrict this irregularity to Southern Khanty and treat Konda tŏw and NKh *tŭw as regular reflexes. — Maybe there is some possibility that the SKh development here and in ‘bone’ can be explained as *ŭw > *ū > *ǖ > *ü̆w > əw, leveraging the known fronting *ū > *ǖ? It doesn’t look like *ŭw and *ū actually contrast at all, so the first step here might be entirely virtual.
[2] Хелимский, Е. А.: О соответствиях уральских a- и e-основ в тазовском диалекте селькупского языка. – Советскoе финно-угроведение 12: 113–132. No cognates known elsewhere in Samoyedic, but the simplification *wo- > *o- would have to be pre-PS anyway, since by PS a new *wo- does exist and per two examples yields in Selkup *ko- as expected: *woəj > *ko ‘island, hill’; *wotå > *kotə ‘blueberry’.
[3] Though, since PS shows *r > *l / C_ in various suffixes, could it be possible that after *j, the resulting cluster further coalescend to *ľ, and then evolved into just *j as usual? In this case Fi. kuiri and PS *kuj could both go back to PU *kujrə (now with no especial reason to suspect a suffix in there).
[4] For a formal match and semantics within speculation distance, cf. PU *luwə ‘south’ ≈ ‘direction where the weather is warm’?? Seems unlikely but not impossible.
[5] And cf. further PU *pura ‘drill’, also already proposed to be an IE loan. So far it seems morphologically unclear to me how to connect this with either the PU or PIE verbs, though.
Reviewing UraLex
Nerdsnipe of the day: the BEDLAN team, researching diversification of the Uralic languages interdisciplinarily, mentioned earlier today that they will be soon uploading version 3 of their UraLex dataset of basic vocabulary across Uralic. I thought this might be a good time to do a look-over of the data, from a not-that-computational historical linguist’s point of view (i.e. mostly on the contents, not the technical details). Maybe these comments will be helpful either to the team or to other people aiming at similar projects.
Data sources
The selection / definition of languages looks mostly good already to me, with varieties being specified fairly closely, including details like “Sosva Mansi” rather than just “Northern Mansi”. Unmarked “Selkup” is however questionable at least. This is claimed in the documentation to be more specifically Taz Northern Selkup, the currently most vital dialect [1] and the basis of current written Selkup. The listed forms, though, often look more like the Proto-Selkup reconstructions from Sölkupisches Wörterbuch, e.g. in retaining PSk *č (> modern NSk /t/) and *uə (> *Cʷë > modern NSk /Cɤ/, /wɤ/). A similar issue is the database’s “Karelian Proper”. This too does not appear to be any real variety of Karelian, but rather the interdialectal lemma forms of Karjalan kielen sanakirja, which are frankly overly Finnishized (not really actual Proto-Karelian), and elide many important contrasts, especially voiced obstruents and, mostly, the s / š contrast. E.g. rasva for ‘fat’ only appears as such in the Oulanka dialect. Most northern Karelian has rašva, much of southern Karelian razva, some intermediate southern dialects ražva.
The KKS and SkWb lemmas are probably tolerable as lexicostatistic indices to Karelian and Selkup, but I hope some future update might fix this in favor of actually-recorded language varieties — and certainly before anyone tries to do phonological analysis with this data!
I would have some desiderata myself on what varieties’ classification would be interesting to gage by their lexicon. Foremost maybe transitional varieties, such as Karelian Isthmus Finnish; NE Erzya and Shoksha; Pelym, Lozva & Eastern Mansi; Berezovo, Nizyam, Salym & Vartovskoe Khanty; anything really among the Selkup dialects. But it’s possible that this is too fine detail for a Uralic-wide dataset and would call for within-language-group studies instead, similar to Rydving (2013) on Sami. And it appears that the most important additions for within-Uralic study have are already been planned: adding Moksha besides the currently represented Erzya; Hill Mari besides Meadow Mari; Obdorsk (Northern) Khanty and Pelym (Western) Mansi varieties besides EKh and NMs; Kamassian and Mator within Samoyedic. These should cover many bases. E.g. the well-known Mansi cognate(s) of Hung. tűz, EKh tö̆ɣət ‘fire’ are not recorded from NMs, but do appears in WMs (Pelym toåwt, Upper Lozva töät, North Vagilsk tüöwt, etc.)
A different point entirely is that attempts to study specifically the interrelationships of the nine basic Uralic branches would, I think, function the best if using their protolanguages as the basic data points. There are too a few gotcha cases where no coverage of modern-day languages is sufficient: occasional native Uralic terms might be reconstructible for Proto-Mansi only from early 19th century wordlists, for Proto-Samoyedic only from Castrén’s mid-19th century records, for Proto-Mordvinic only from Witsen’s 18th century records, for Proto-Hungarian only from early medieval records, etc. Comparative-historical Uralistics is maybe not particularly philology-centered, but has never been able to afford overlooking philology entirely. [2]
The selection of semantic concepts to cover is generally reasonable, pulled from major basic vocabulary lists like various Swadesh lists and the Leipzig-Jakarta list. Some of the items on these do break up completely to noise within Uralic, but that’s a good point to have on record as well. I do not think the classic Swadesh list was assembled very rigorously, and at some point it would be good to know not just something about the relative average stability of concepts on it, but also their variance in stability across different language families. An example I have often mentioned in dicussions related to this is how in Uralic, ‘fish’ and ‘moon’ are highly stable, while ‘cow’ is unreconstructible and ‘sun’ is highly unstable; while in Indo-European, ‘cow’ and ‘sun’ are highly stable, vs. ‘fish’ unstable and ‘moon’ just about unreconstructible. (This phenomenon e.g. already constitutes a fairly strong critique of glottochronology or any models resembling it, which would rather predict average variance to be a monotonic function of average stability.) — Many of the more unstable and entirely unreconstructible concepts seem to be from the LJ list. This is basically what we should expect I think, since these have been selected only by their stability vs. loaning, not vs. all the other lexical innovation processes out there like derivation, semantic shifts, onomatopoeia, a priori coinages (and also not even vs. the likelihood of synchronic synonymy).
There are regardless still many semantic concepts or etymological groups that I think would have a bunch to say about the diversification of Uralic, but which haven’t made the mark. These are I suspect typically more Uralic-specific, and they could not be easily located by general cross-linguistic considerations. Simple examples include e.g. terms for local fauna (*śixələ ‘hedgehog’, *onča ‘nelma, Stenodus‘), flora (*ďëmə ‘bird cherry’, *pečä ‘pine’) and technology (*joŋsə ‘bow’, *ńëlə ‘arrow’). More involved examples tend towards etyma that Helimski (2001) has called core vocabulary as distinct from basic vocabulary: often verb roots, relational terms, or incipiently grammaticalizing body part terms, that may not have strong semantic stability but do have decent etymological stability. In Uralic thus e.g. *kixə- ‘to rut, lek, be excited, lustful, want’, *kulə- ‘to go out, run out, wear, end’; *pučkə ‘hollow, tube, inside, marrow’; *pončə ‘tail, hem, back part’ (glosses not meant as PU but indicating the range of variation in reflexes). Most regular lexicostatistic methods run poorly however if matched against etyma that don’t have stable or well-defined proto-meanings, e.g. we can’t really ask what is “the” replacement of such an item in a language that has lost it. Down the line, some new techniques entirely will be required for making use of this kind of data instead.
Phonetics & Phonology
I do not know what use, if any, is planned for this part of the data, but especially inconsistent IPA transcription seems to remain a major problem, as many other times in Uralic studies.
I suspected data mix-up initially, but this clearly must be a processing problem instead, given even e.g. köχ ‘stone’: no such form appears anywhere in Khanty (it’s VVj köɣ, Jugan kä̆w, other Surgut kä̆ɣʷ, all western kew). Are these words derived from some orthographic source that spells VVj /ɣ/ as Cyrillic ‹х›, by any chance? (But still correct forms in many other cases like oɣ ‘head’, soɣ ‘worm’, wajəɣ ‘bird’.)
Looking over these issues, I could formulate a Rule #1 for IPA-fying FUT: the transcription systems do not correspond 1:1 and several details must be, alas, checked on a language-by-language basis. Especially vital is understanding your source data: whether whatever you are IPA-fying is pre-WW2 “hyperphonetic” FUT; mid-century “major-allophonic” FUT; or post-70s “phonological” FUT. IPA comes with its bracket notation [d͇], /d/, //ð// etc. to warn what level of transcription you might be dealing with… FUT does not, perhaps its biggest flaw. A related Rule #2 might be that it’s similarly important to understand what you are trying to do with IPA: phonological, broad phonetic or narrow phonetic transcription? Most of the time, there is no One Correct IPA Representation either.
In the base FUT data I do not see any further major issues. It would be probably good to make sure to distinguish ´ (the suprasegmental palatalization sign) and ˈ (the overlength / strong-grade cluster sign) in the Samic data though. Currently both seem to be much of the time encoded as a simple apostrophe; e.g. Inari Sami kyevˈđi ‘snake’, Skolt ku´vdd ‘id.’ are given as “kyev’di”, “ku’vdd”. Occasionally even opening or closing single quotes appear (thanks, Microsoft). Apostrophes do actually even triple duty in marking palatalized ľ in other languages, but this seems unlikely to do any real harm.
Protoforms
The dataset is of course primarily about attested lexical data, so I maybe should not spend too much time on examining the proto-language reconstructions included (only Proto-Uralic, no intermediate reconstructions). Still, this is protouralic dot wordpress I am blogging at, so some observations on that topic too.
The transcription scheme seems to closely follow Janhunen 1981, Sammallahti 1988. The *i/i̮ reconstruction for noninitial syllables is used almost thruout; an *-e- has slipped in only in *koje-mV ‘husband’. *i̮ rather than *e̮ is used in initial syllables too, however still an **a in at least a few lexemes like *maksa ‘liver’, *maɣi̮ ‘earth’ (= J *mi̮kså; S *mɨkså, *mɨxi); also *ńś rather than *ńć, though a traditional *ć is still retained in some cases. Different transcription schemes are more inconsistently mixed for the “voiced spirants”, including ‹δ› in *śaδa- ‘rain’, but ‹ð› in *wuði̮- ‘new’; ‹x› in *juxi̮- ‘to drink’, but ‹ɣ› in *miɣi- ‘to give’.
A possible consequence of the dataset’s original compilation for a lexicostatistic review of the traditional Uralic classification is also that some meanings are marked as “[Not reconstructible]”, although they would have well-established though western-leaning proto-forms, e.g. *külmä ‘cold’ (maybe debatable; an IMO poor loan etymology from Balt(o-Slav)ic remains marked for the reflexes), *mälə ‘mind’ (clearly PU; this is reflected in derived verbs in Ob-Ugric), *läwlə ‘heavy’ (EKh ‘cold’ probably doesn’t belong). Some items reconstructed in recent literature are missing too, e.g. Aikio’s revamped *këččə ‘bitter’, *widä- ‘to kill’. More worrying for me is how also many long-known proto-forms are left absent, such as *küsə ‘thick’, *näkə- ‘to see’ (admittedly most reflexes derivatives w/o this meaning), *lükkä- and *puskə- both ‘to push’, *śepä ‘neck’, *sańća- ‘to stand’, *wëlkə ‘white’. I don’t think this can be just due to later semantic divergence in some reflexes, when e.g. *jelä ‘day’ has been admitted as a PU form only from Samoyedic direct evidence (parallels also at minimum in Samic); and *śilä ‘fat’ from no direct evidence at all? Yet also some poor comparisons from UEW seem to remain around, e.g. “*čočV-” ‘to wipe’; actually its only reflex meaning ‘wipe’ is Finnish huosi-, which I don’t think can belong here. [3] — These types of issues may even combine for more involved cases. E.g. the PU word for ‘full’ is given as *türə, a narrowly distributed Finnic–Permic etymon, and not the better-distributed *täwdə. This is again probably per UEW, which maintains Selkup tīr as reflecting the former and not, as recognized since Aikio 2002, the latter. [4] Or, the word for ‘year’ is given as *ärV; but this reconstruction was in effect already refuted by Aikio 2012, who points out that the Samoyedic forms (meaning ‘fall’) go back to PSmy back-vocalic *ër-, which continues rather the already better-distributed PU form *ëdə. [5]
A methodological choice also seems to have been that no synonyms are admitted for PU, although there probably are a few concepts in the data for which they existed; e.g. besides *śilä for ‘fat’, we can reconstruct also *wajə, *koja (both already alluded to in the database; the former though specializing to ‘butter’ in most Uralic languages familiar with agriculture).
(All my Uralonet links above show what I think of as their most reliable reconstructions, but defending those would be at times quite a debate that I don’t intend to get into in detail here — I’ll be happy as long as the reconstruction system chosen is at least internally consistent enough.)
Since following newer literature adequately appears to have given some difficulty for the team, I would like to note here (I think for the first time on this blog) that I’ve already a few years ago started a little repository of new results in Uralic etymology, currently keeping track of
The list(s) can be found at the Sanat wiki, as a part of / appendix to our etymological database of Proto-Finnic. [6]
Currently pending updates include, besides better coverage of several earlier but post-UEW sources, especially several new native and loan etymologies for Mari and Permic from Metsäranta’s PhD thesis from 2020. I have also been thinking of starting an “antietymological” sister repository, tracking PU reconstructions that have been clearly disproven by better etymologies being published for all or all-but-one of their reflexes, of which there are quite a few by now too.
Etymological marking
Maybe the core content of the dataset. Standard literature has been followed quite faithfully here and I see no major flaws (even where etymological relationships have not been seen fit to be promoted to Proto-Uralic status). Mostly I can just point out some recent and overlooked results. Besides cases already mentioned:
I suppose this is by now enough comments for one day. I know that assembling and curating datasets this big is quite the task, and I could probably also spend a week more reading this in further detail. Hopefully I’ve already pointed out some productive directions for future improvement though. (And if you were thinking of otherwise releasing 3.0 just tomorrow: sure, don’t mind me, there will be time in the future too to improve things.)
Edit 2022-06-27: See also some brief responses from Outi Vesakoski (and further from me) at Twitter!
[1] Very relatively so: at triple rather than double or single digits of speakers.
[2] So far the biggest gap in philological coverage are probably the old Swedish “Biblical Sami” records, substantial already in the 18th century, but to my knowledge they have never been looked over in detail etymologically.
[3] Has been further etymologized as being maybe from Proto-Finnic *hosja ~ *hoosja ‘horsetail, Equisetum‘ (traditionally used to make scrubs), which I don’t think has itself any etymology yet. By its phonological structure it obviously cannot be native Uralic as is. Inverting the semantic derivation though, an irregular (?) contraction from an agent noun *hosija < *hose/i-ja ‘sweeper, scrubber’ might be possible (cf. also Fi. hos-u- ‘to work carelessly, in a rush’). Or if this is, as UEW’s etymology would imply, really assibilated *hocija… a root that looks somewhat compareable to me is Samic–Mordvinic *šodə- ‘to let out, run out’ (maybe first derived to *šodə-j- > *hoci- ‘to throw/sweep things out’). A PU *čočV-, on the other hand, should not give Finnic *h- but *s-, via the affricate dissimilation seen also in e.g. *čečä ‘uncle’ > *ćečä > PF *setä.
[4] Worth noting, besides Aikio’s argument that cognates elsewhere in Samoyedic require a protoform with *ä-ə, is also that *türə would be expected to give Sk. **tir with a short vowel. tīr shows Helimski’s Law = Proto-Selkup vowel lengthening in Proto-Samoyedic *ə-stems, < PU *CVCCə stems and some *CV(C)CA stems (a relatively recent discovery from 2007).
[5] This does still leave Permic *ar ~ (Core) Mansi *ārmə (closed syllable per Pelym årəm with a short vowel), but the latter should clearly be analyzed a loan from the former; more specifically, from derived *arm as reflected in Udmurt. Permic *a has no well-established native source at all and even some more dubious cases only really point to some possible origin from *ä.
[6] “Us” being myself, Santeri Junttila, Sampsa Holopainen & Juha Kuokkala, plus original data assembly by Kallio.
Posted in Commentary, Links