Nerdsnipe of the day: the BEDLAN team, researching diversification of the Uralic languages interdisciplinarily, mentioned earlier today that they will be soon uploading version 3 of their UraLex dataset of basic vocabulary across Uralic. I thought this might be a good time to do a look-over of the data, from a not-that-computational historical linguist’s point of view (i.e. mostly on the contents, not the technical details). Maybe these comments will be helpful either to the team or to other people aiming at similar projects.
The selection / definition of languages looks mostly good already to me, with varieties being specified fairly closely, including details like “Sosva Mansi” rather than just “Northern Mansi”. Unmarked “Selkup” is however questionable at least. This is claimed in the documentation to be more specifically Taz Northern Selkup, the currently most vital dialect  and the basis of current written Selkup. The listed forms, though, often look more like the Proto-Selkup reconstructions from Sölkupisches Wörterbuch, e.g. in retaining PSk *č (> modern NSk /t/) and *uə (> *Cʷë > modern NSk /Cɤ/, /wɤ/). A similar issue is the database’s “Karelian Proper”. This too does not appear to be any real variety of Karelian, but rather the interdialectal lemma forms of Karjalan kielen sanakirja, which are frankly overly Finnishized (not really actual Proto-Karelian), and elide many important contrasts, especially voiced obstruents and, mostly, the s / š contrast. E.g. rasva for ‘fat’ only appears as such in the Oulanka dialect. Most northern Karelian has rašva, much of southern Karelian razva, some intermediate southern dialects ražva.
The KKS and SkWb lemmas are probably tolerable as lexicostatistic indices to Karelian and Selkup, but I hope some future update might fix this in favor of actually-recorded language varieties — and certainly before anyone tries to do phonological analysis with this data!
I would have some desiderata myself on what varieties’ classification would be interesting to gage by their lexicon. Foremost maybe transitional varieties, such as Karelian Isthmus Finnish; NE Erzya and Shoksha; Pelym, Lozva & Eastern Mansi; Berezovo, Nizyam, Salym & Vartovskoe Khanty; anything really among the Selkup dialects. But it’s possible that this is too fine detail for a Uralic-wide dataset and would call for within-language-group studies instead, similar to Rydving (2013) on Sami. And it appears that the most important additions for within-Uralic study have are already been planned: adding Moksha besides the currently represented Erzya; Hill Mari besides Meadow Mari; Obdorsk (Northern) Khanty and Pelym (Western) Mansi varieties besides EKh and NMs; Kamassian and Mator within Samoyedic. These should cover many bases. E.g. the well-known Mansi cognate(s) of Hung. tűz, EKh tö̆ɣət ‘fire’ are not recorded from NMs, but do appears in WMs (Pelym toåwt, Upper Lozva töät, North Vagilsk tüöwt, etc.)
A different point entirely is that attempts to study specifically the interrelationships of the nine basic Uralic branches would, I think, function the best if using their protolanguages as the basic data points. There are too a few gotcha cases where no coverage of modern-day languages is sufficient: occasional native Uralic terms might be reconstructible for Proto-Mansi only from early 19th century wordlists, for Proto-Samoyedic only from Castrén’s mid-19th century records, for Proto-Mordvinic only from Witsen’s 18th century records, for Proto-Hungarian only from early medieval records, etc. Comparative-historical Uralistics is maybe not particularly philology-centered, but has never been able to afford overlooking philology entirely. 
The selection of semantic concepts to cover is generally reasonable, pulled from major basic vocabulary lists like various Swadesh lists and the Leipzig-Jakarta list. Some of the items on these do break up completely to noise within Uralic, but that’s a good point to have on record as well. I do not think the classic Swadesh list was assembled very rigorously, and at some point it would be good to know not just something about the relative average stability of concepts on it, but also their variance in stability across different language families. An example I have often mentioned in dicussions related to this is how in Uralic, ‘fish’ and ‘moon’ are highly stable, while ‘cow’ is unreconstructible and ‘sun’ is highly unstable; while in Indo-European, ‘cow’ and ‘sun’ are highly stable, vs. ‘fish’ unstable and ‘moon’ just about unreconstructible. (This phenomenon e.g. already constitutes a fairly strong critique of glottochronology or any models resembling it, which would rather predict average variance to be a monotonic function of average stability.) — Many of the more unstable and entirely unreconstructible concepts seem to be from the LJ list. This is basically what we should expect I think, since these have been selected only by their stability vs. loaning, not vs. all the other lexical innovation processes out there like derivation, semantic shifts, onomatopoeia, a priori coinages (and also not even vs. the likelihood of synchronic synonymy).
There are regardless still many semantic concepts or etymological groups that I think would have a bunch to say about the diversification of Uralic, but which haven’t made the mark. These are I suspect typically more Uralic-specific, and they could not be easily located by general cross-linguistic considerations. Simple examples include e.g. terms for local fauna (*śixələ ‘hedgehog’, *onča ‘nelma, Stenodus‘), flora (*ďëmə ‘bird cherry’, *pečä ‘pine’) and technology (*joŋsə ‘bow’, *ńëlə ‘arrow’). More involved examples tend towards etyma that Helimski (2001) has called core vocabulary as distinct from basic vocabulary: often verb roots, relational terms, or incipiently grammaticalizing body part terms, that may not have strong semantic stability but do have decent etymological stability. In Uralic thus e.g. *kixə- ‘to rut, lek, be excited, lustful, want’, *kulə- ‘to go out, run out, wear, end’; *pučkə ‘hollow, tube, inside, marrow’; *pončə ‘tail, hem, back part’ (glosses not meant as PU but indicating the range of variation in reflexes). Most regular lexicostatistic methods run poorly however if matched against etyma that don’t have stable or well-defined proto-meanings, e.g. we can’t really ask what is “the” replacement of such an item in a language that has lost it. Down the line, some new techniques entirely will be required for making use of this kind of data instead.
Phonetics & Phonology
I do not know what use, if any, is planned for this part of the data, but especially inconsistent IPA transcription seems to remain a major problem, as many other times in Uralic studies.
- v is transcribed as a fricative /v/ rather than the approximant /ʋ/ for Estonian, Votic and Ingrian (though correct in Finnish).
- A phenomenon I’ve seen in many online sources over the last ~10 years, Finnish h is given superfluous and partly incorrect transcription as /ç/, /x/ in many clusters and /ɦ/ in many medial positions. E.g. karhea ‘rough’ as “/karçe̞a/”, though fricative allophones only appear with any systematicity in the syllable coda. Even then these have enough variability that I would think leaving this as phonological /h/ would be surely the safest choice.
- Some Finnish falling diphthongs are transcribed with glides as the 2nd component (aurinko ‘sun’ /ɑwriŋko̞/, koira ‘dog’ /ko̞jrɑ/), others with close vowels (jauhaa ‘crush’ /jɑuɦɑː/, oikea ‘right’ /o̞ike̞ɑ/).
- Estonian length marking is a mess. -p- -t- -k- appear seemingly at random as both /p t k/ (thus also -b- -d- -g-) or /pː tː kː/ (thus also -pp- -tt- -kk-); sometimes even in the same word, e.g. lükata ‘to push’ as “/lykɑtːɑ/” (as if ˣlügatta ?)! I don’t have strong opinions on if it’s more proper to use /pˑ tˑ kˑ/ for transcribing grade 2, or maybe /pːː tːː kːː/ for grade 3, but please at least make the distinction. — I’m not even going to start on long/short clusters or overlong vowels, which are maybe less phonologically relevant anyway.
- Estonian palatalization has also gone absent, e.g. lill ‘flower’ as /lilː/ and not /lilʲː/. Also, four slip-ups of õ turning up as IPA /ɣ/ rather than /ɤ/: “/hɣːrutɑ/” ‘rub’, “/kɣvɑ/” ‘loud’ (but correct in /kɤvɑ/ ‘hard’!), “/lɣkːs/” ‘trap’, “/mɣmisetɑ/” ‘mumble’.
- Votic transcription includes some allophones like [d̥ g̊ vʲ ɑˑ], but leaves unmarked maybe the most prominent allophone in the language, л = [ɫ], “dark L”. I did not catch any ˣ/ɣ/ pro /ɤ/ mistakes.
- I’m happy to see that most languages’ palato-alveolar ľ, ń, ś etc. have been transcribed as /ʎ/, /ɲ/, /ɕ/ etc. rather than incorrect /lʲ/, /nʲ/ /sʲ/ seen in many naive attempts to IPA-fy Finno-Ugric transcription; … but this has been overdone to include also Erzya, for which palatalized alveolars are correct. Not a major issue ultimately, but still an inconsistency.
- Meadow Mari ə̑ has been transcribed as /ə̱/, which is a bit superfluous; /ə/ would be sufficient. (It is rather Hill Mari ə (= reduced e) that would call for a diacritic in IPA, probably /ĕ/ or /ə̟/.) — The Ob-Ugric data has had the ə / ə̑ distinction phonologized away entirely, though if desired, it could be maintained phonetically at least in Eastern Khanty.
- Komi and Udmurt: FUT i̮ / literary ‹ы› is given as /ɯ/, rather than the more correct /ɨ/, and e̮ / ‹ӧ› has been rendered as /ɤ/ though probably /ə/ or /ɘ/ would be likewise more consistent (as in the Oxford handbook of Uralic from this spring). Even a / ‹а› might be for the Permic languages better rendered as IPA /a/ (unlike most of Uralic, where a contrasts with /æ/ and is thus indeed better rendered as IPA /ɑ/).
- Hungarian uses tie bars for some its affricates, /t͡s t͡ʃ/ etc. Not incorrect in any way, but this is used nowhere else in the data and not even entirely consistent within Hungarian. I also notice a straggling flap /ɾ/ appearing in erdő ‘woods’, féreg ‘worm’ that seems like an error.
- Uvulars in Khanty aren’t dealt with very consistently at all. [q ʁ] as back-vocalic allophones of /k ɣ/ go unremarked, but /χ/ is indeed transcribed as uvular (ditto in Mansi). Worse, some data with /χ/ has been incorrectly entered for Vakh-Vasyugan Khanty, e.g. jŏχət- ‘come’, koχ ‘long’ (the actual VVj forms are jŏɣət-, koɣ). Only western Khanty ever has χ!
I suspected data mix-up initially, but this clearly must be a processing problem instead, given even e.g. köχ ‘stone’: no such form appears anywhere in Khanty (it’s VVj köɣ, Jugan kä̆w, other Surgut kä̆ɣʷ, all western kew). Are these words derived from some orthographic source that spells VVj /ɣ/ as Cyrillic ‹х›, by any chance? (But still correct forms in many other cases like oɣ ‘head’, soɣ ‘worm’, wajəɣ ‘bird’.)
Looking over these issues, I could formulate a Rule #1 for IPA-fying FUT: the transcription systems do not correspond 1:1 and several details must be, alas, checked on a language-by-language basis. Especially vital is understanding your source data: whether whatever you are IPA-fying is pre-WW2 “hyperphonetic” FUT; mid-century “major-allophonic” FUT; or post-70s “phonological” FUT. IPA comes with its bracket notation [d͇], /d/, //ð// etc. to warn what level of transcription you might be dealing with… FUT does not, perhaps its biggest flaw. A related Rule #2 might be that it’s similarly important to understand what you are trying to do with IPA: phonological, broad phonetic or narrow phonetic transcription? Most of the time, there is no One Correct IPA Representation either.
In the base FUT data I do not see any further major issues. It would be probably good to make sure to distinguish ´ (the suprasegmental palatalization sign) and ˈ (the overlength / strong-grade cluster sign) in the Samic data though. Currently both seem to be much of the time encoded as a simple apostrophe; e.g. Inari Sami kyevˈđi ‘snake’, Skolt ku´vdd ‘id.’ are given as “kyev’di”, “ku’vdd”. Occasionally even opening or closing single quotes appear (thanks, Microsoft). Apostrophes do actually even triple duty in marking palatalized ľ in other languages, but this seems unlikely to do any real harm.
The dataset is of course primarily about attested lexical data, so I maybe should not spend too much time on examining the proto-language reconstructions included (only Proto-Uralic, no intermediate reconstructions). Still, this is protouralic dot wordpress I am blogging at, so some observations on that topic too.
The transcription scheme seems to closely follow Janhunen 1981, Sammallahti 1988. The *i/i̮ reconstruction for noninitial syllables is used almost thruout; an *-e- has slipped in only in *koje-mV ‘husband’. *i̮ rather than *e̮ is used in initial syllables too, however still an **a in at least a few lexemes like *maksa ‘liver’, *maɣi̮ ‘earth’ (= J *mi̮kså; S *mɨkså, *mɨxi); also *ńś rather than *ńć, though a traditional *ć is still retained in some cases. Different transcription schemes are more inconsistently mixed for the “voiced spirants”, including ‹δ› in *śaδa- ‘rain’, but ‹ð› in *wuði̮- ‘new’; ‹x› in *juxi̮- ‘to drink’, but ‹ɣ› in *miɣi- ‘to give’.
A possible consequence of the dataset’s original compilation for a lexicostatistic review of the traditional Uralic classification is also that some meanings are marked as “[Not reconstructible]”, although they would have well-established though western-leaning proto-forms, e.g. *külmä ‘cold’ (maybe debatable; an IMO poor loan etymology from Balt(o-Slav)ic remains marked for the reflexes), *mälə ‘mind’ (clearly PU; this is reflected in derived verbs in Ob-Ugric), *läwlə ‘heavy’ (EKh ‘cold’ probably doesn’t belong). Some items reconstructed in recent literature are missing too, e.g. Aikio’s revamped *këččə ‘bitter’, *widä- ‘to kill’. More worrying for me is how also many long-known proto-forms are left absent, such as *küsə ‘thick’, *näkə- ‘to see’ (admittedly most reflexes derivatives w/o this meaning), *lükkä- and *puskə- both ‘to push’, *śepä ‘neck’, *sańća- ‘to stand’, *wëlkə ‘white’. I don’t think this can be just due to later semantic divergence in some reflexes, when e.g. *jelä ‘day’ has been admitted as a PU form only from Samoyedic direct evidence (parallels also at minimum in Samic); and *śilä ‘fat’ from no direct evidence at all? Yet also some poor comparisons from UEW seem to remain around, e.g. “*čočV-” ‘to wipe’; actually its only reflex meaning ‘wipe’ is Finnish huosi-, which I don’t think can belong here.  — These types of issues may even combine for more involved cases. E.g. the PU word for ‘full’ is given as *türə, a narrowly distributed Finnic–Permic etymon, and not the better-distributed *täwdə. This is again probably per UEW, which maintains Selkup tīr as reflecting the former and not, as recognized since Aikio 2002, the latter.  Or, the word for ‘year’ is given as *ärV; but this reconstruction was in effect already refuted by Aikio 2012, who points out that the Samoyedic forms (meaning ‘fall’) go back to PSmy back-vocalic *ër-, which continues rather the already better-distributed PU form *ëdə. 
A methodological choice also seems to have been that no synonyms are admitted for PU, although there probably are a few concepts in the data for which they existed; e.g. besides *śilä for ‘fat’, we can reconstruct also *wajə, *koja (both already alluded to in the database; the former though specializing to ‘butter’ in most Uralic languages familiar with agriculture).
(All my Uralonet links above show what I think of as their most reliable reconstructions, but defending those would be at times quite a debate that I don’t intend to get into in detail here — I’ll be happy as long as the reconstruction system chosen is at least internally consistent enough.)
Since following newer literature adequately appears to have given some difficulty for the team, I would like to note here (I think for the first time on this blog) that I’ve already a few years ago started a little repository of new results in Uralic etymology, currently keeping track of
- newly proposed PU reconstructions;
- newly found reflexes of known reconstructions;
- newly found loan etymologies for what have previously been thought of as native Uralic etyma.
The list(s) can be found at the Sanat wiki, as a part of / appendix to our etymological database of Proto-Finnic. 
Currently pending updates include, besides better coverage of several earlier but post-UEW sources, especially several new native and loan etymologies for Mari and Permic from Metsäranta’s PhD thesis from 2020. I have also been thinking of starting an “antietymological” sister repository, tracking PU reconstructions that have been clearly disproven by better etymologies being published for all or all-but-one of their reflexes, of which there are quite a few by now too.
Maybe the core content of the dataset. Standard literature has been followed quite faithfully here and I see no major flaws (even where etymological relationships have not been seen fit to be promoted to Proto-Uralic status). Mostly I can just point out some recent and overlooked results. Besides cases already mentioned:
- The Hungarian word for ‘claw, nail’ has been unfortunately given as the less basic karom ‘claw, talon’ rather than köröm ‘claw, nail’; which was, even, recently argued by Aikio to be indeed a reflex of PU *künčə.
- The Samoyedic words for ‘to scratch’ derive from PSmy *kətå ‘nail’, much as also e.g. Khanty *kö̆nč- does double duty as ‘nail; scratch’. The base noun is, probably correctly, not admitted as a cognate of rest-Uralic *künčə. The verb entry however inconsistently does encode them as cognates.
- The most notable loan etymology missing entirely is probably the derivation of Erzya veśe and Hungarian össze- ‘all’ from earlier *wiśwV- ← Proto-Indo-Iranian *wićwa- > Sanskrit víśva, etc. (an etymology due to Katz 2003 that was unfortunately overlooked by Holopainen 2019). Both are regular: for Hungarian *wi- > *wü- > *ü- also in IIr. loans (besides native ones like *widä > öl- ‘to kill’), cf. the already long-known özvegy ‘widow’ < *wiðVwädźV ← Scythian / pre-Alanian *widawa-čī.
- There is probably room to adjust many of the individual loanword etymologies, e.g. Kildin Sami sūll´ ‘salt’ is not borrowed from Russian сол but, as maybe the palatalization best reveals, from Finnic *soola (thus also UEW, SSA). This would regularly continue a Proto-Samic *sōlē > Peninsular Eastern Sami *suəllʲe, also present in Skolt suõ´ll. Would be way too much work for me to start digging into these on my own though with any consistency.
- There are, on the other hand, still several Proto-Indo-European loanword etymologies advanced that do not seem very reliable (were they ever widely accepted?), e.g. *pelə- ‘to fear’ ~ PIE *pelh₁- ‘to shake’ (which only gives ‘fearful’ in derivatives in Gothic and Slavic); *śalə ‘gut’ ~ PIE ? *ḱolH- ‘turn’ (which only gives ‘gut’ in Greek). These are though only marked as “probable”, not “clear” — is this basically an euphemism for “not that likely”?
I suppose this is by now enough comments for one day. I know that assembling and curating datasets this big is quite the task, and I could probably also spend a week more reading this in further detail. Hopefully I’ve already pointed out some productive directions for future improvement though. (And if you were thinking of otherwise releasing 3.0 just tomorrow: sure, don’t mind me, there will be time in the future too to improve things.)
Edit 2022-06-27: See also some brief responses from Outi Vesakoski (and further from me) at Twitter!
 Very relatively so: at triple rather than double or single digits of speakers.
 So far the biggest gap in philological coverage are probably the old Swedish “Biblical Sami” records, substantial already in the 18th century, but to my knowledge they have never been looked over in detail etymologically.
 Has been further etymologized as being maybe from Proto-Finnic *hosja ~ *hoosja ‘horsetail, Equisetum‘ (traditionally used to make scrubs), which I don’t think has itself any etymology yet. By its phonological structure it obviously cannot be native Uralic as is. Inverting the semantic derivation though, an irregular (?) contraction from an agent noun *hosija < *hose/i-ja ‘sweeper, scrubber’ might be possible (cf. also Fi. hos-u- ‘to work carelessly, in a rush’). Or if this is, as UEW’s etymology would imply, really assibilated *hocija… a root that looks somewhat compareable to me is Samic–Mordvinic *šodə- ‘to let out, run out’ (maybe first derived to *šodə-j- > *hoci- ‘to throw/sweep things out’). A PU *čočV-, on the other hand, should not give Finnic *h- but *s-, via the affricate dissimilation seen also in e.g. *čečä ‘uncle’ > *ćečä > PF *setä.
 Worth noting, besides Aikio’s argument that cognates elsewhere in Samoyedic require a protoform with *ä-ə, is also that *türə would be expected to give Sk. **tir with a short vowel. tīr shows Helimski’s Law = Proto-Selkup vowel lengthening in Proto-Samoyedic *ə-stems, < PU *CVCCə stems and some *CV(C)CA stems (a relatively recent discovery from 2007).
 This does still leave Permic *ar ~ (Core) Mansi *ārmə (closed syllable per Pelym årəm with a short vowel), but the latter should clearly be analyzed a loan from the former; more specifically, from derived *arm as reflected in Udmurt. Permic *a has no well-established native source at all and even some more dubious cases only really point to some possible origin from *ä.
 “Us” being myself, Santeri Junttila, Sampsa Holopainen & Juha Kuokkala, plus original data assembly by Kallio.