Back in 2009, a very interesting paper was put out by Jaakko Häkkinen, then an early-stage PhD student:  “Kantauralin ajoitus ja paikannus: perustelut puntarissa“. While no longer especially up to date (I will probably follow up on this claim in another post soon-ish, once one major paper in the works has come out in a future issue of Diachronica), this still remains a notable work that has turned out to be an impetus for quite a lot of discussion over the 10s and ongoing, on our basic assumptions about the early history of the Uralic languages. One of Häkkinen’s suggestions is to attribute some of the shared Finnic–Mordvinic vocabulary to a common southwestern substrate language. He outlines this on the basis of just six words that can be suspected to be of substratal origin per their semantics: three deciduous trees with a southern distribution (the word families of Finnish tammi ‘oak’, vaahtera ‘maple’, pähkinä ‘nut’ < *’hazelnut’ ), two species of high importance to agricultural societies (Fi. vehnä ‘wheat’, lehmä ‘cow’), and one innovative numeral (Fi. kymmen(en) ’10’), and which all also show novel phonotactic features: the word-medial consonant clusters *-mm-, *-kšt-, *-šk-, *-šn-, *-šm-, per him not attested in the Uralic comparative data reaching into the Ugric or Samoyedic languages. Häkkinen mentions also some more narrowly distributed substrate loan candidates with similar phonotactic features (e.g. with geminate nasals: Fi. konna ‘toad’, nummi ‘heath’; Northern Sami lidnu ‘eagle owl’, dápmot ‘trout’) that had been identified already in still earlier studies probing the possibly substratal vocabulary of Finnic or Samic in particular. But as far as I can tell, the idea of a common substrate vocabular layer extending also further east to Mordvinic, partly even Mari and Permic, was a new key innovation.
Increasing phonotactic complexity towards the (south)western end of Uralic is quite apparent really as soon as you pay attention to the topic. Already in one of my earliest posts on Freelance Reconstruction in 2013 I outlined the branch-level distribution of the clusters *šk, *kš, *kšk and *kšt across the Uralic comparative material. Heavy emphasis on Finnic, Mordvinic and Mari, but also not the northwestern Samic, is immediately evident. So there probably should be quite a lot of material that might be attributable to this “Agricultural Substrate” if we went looking for it in detail. 2014-ish I started collecting some additional data on this, taking particular semantic fields as my starting point. Before this reached sufficient completion though, a few other publications already ended up paying more attention to the same vocabulary stratum. I first saw Ante Aikio’s take, in a preprint version of his article “The Finnic ‘secondary e-stems’ and Proto-Uralic vocalism“. This singles out the consonant *š already by itself as a marker of vocabulary of possibly substratal origin (with 25 examples given; about 10 of them not otherwise phonotactically suspect) as well as proposes 9 other cases more on the basis of general phonological irregularity. As he had worked already earlier extensively on the Samic substrate in Northern Finnic and the pre-Uralic substrate of Samic, perhaps some of this was discovered independently though… Aikio only refers to Häkkinen’s paper passingly, not as a main inspiration.
Before Aikio’s paper officially coming out in 2016 , another version still was also outlined by Mikhail Zhivlov in a small conference paper “Неиндоевропейский субстрат в финно-волжских языках“, which identifies 20 items, likewise on the grounds of phonotactic novelties, the general presence of *š and some phonological irregularities; with substantial overlap with Aikio’s list. Taken together, these were already about as much I had assembled too, and I haven’t done much more on my draft since. Not much else seems to have happened on this topic in the late 10s either.
Last fall however, Carlos Quiles, an archeology/genetics/linguistics blogger at Indo-European.eu now seems to have put together a somewhat more substantial review of this and also some other data relevant for Uralic linguistic archeology, in a series of about ten blog posts starting here. This is nominally aimed more at locating the Proto-Uralic homeland — though it is easy to notice that Quiles relies mostly on secondary sources so far, and seems to miss a decent amount of relevant basic data in his chapters working more towards this goal. E.g. already the section on fishing technology is missing at least *sopśə ‘net needle’ and *tulkV ‘dragnet’; perhaps because these are traditionally identified as “Proto-Finno-Ugric” (only found up to Khanty in the east) and thus absent from earlier sources attempting to apply linguistic archeology to Proto-Uralic specifically. I also wonder about some geographic claims like Udmurt supposedly being spoken within the range of the Siberian pine. Probably today if we count migrant dialects further east and/or planted Siberian pines, but to my knowledge it’s certainly not native to Udmurtia (not even most of Komi Republic).
A full review of this whole topic would be a more involved question than I want to go into on the blog though, and anyway I am also not highly impressed by the overall precision of linguistic archeology as a method. It works just fine for ruling out places like the Circum-Baltic, the Arctic coast or the Caucasus as the Proto-Uralic homeland, but finer details like the long-standing debate on Volga-Kama versus Western Siberian homelands don’t seem like they can be easily resolved. At least two reasons conspire to make further progress difficult. One, if a language family starts off as (a part of) an only slowly expanding or even in situ diversifying dialect continuum, we might have trouble distinguishing “common Family” vocabulary from true proto-Family vocabulary. If any newly incoming vocabulary avoids hitting all the earliest isoglosses within the family, or is etymologically nativized across them, it may end up gaining a wide distribution and an appearence indistinguishable from native. Cases like the common Algonquian calque ‘firewater’ for ‘whisky’ that can be identified as much too recent on cultural grounds are just the tip of the iceberg here. Others could include cases like Proto-Finnic *lohi ~ Proto-Samic *lōsë ‘salmon’, which happen to fall into the outlines of Uralic comparative phonology just fine and would point to a common proto-form *lošə. Both are probably instead more recent loans from Baltic, either independently or in Samic thru Finnic; thus so even of they did really go back to this form in both lineages. From some language pairs like North Estonian ~ South Estonian (last common ancestor ca. 500 BCE), or indeed dialect pairs like Western Finnish ~ Eastern Finnish (LCA ca. 500 CE), with heavily parallel and mutually reinforcing trajectories of historical development up to today, we could probably find examples of this type by the thousands. (I call this phenomenon “convergent parallel loaning” and hope to one day treat it in more detail than just the one presentation in Finnish from 2016 so far. Cf. also Häkkinen’s spin on this under the name “invisible convergence“.)
I also consider it probable that our efforts on Uralic reconstruction so far on many points stops at the common Uralic stage, maybe especially in vocalism, not quite yet reaching Proto-Uralic proper. This is evident when attempting to reconstruct the proto-forms of several core vocabulary items, e.g. ‘heart’. West Uralic (Samic, Finnic, Mordvinic) suggests *ćüdäm(ə); Udmurt /śulem/ suggests *śedämV; Komi /śëlëm/ suggests *śädämV; Ugric suggests *śiďVmV or even *śijVmV; Samoyedic *säjä suggests *śäďä or *śäjä. We have no especially good way to explain most of this kind of “proto-variation” or to decide which of any of these variants might be the most original (of course at least the vowel difference between Udmurt and Komi is likely to be recent). The suggestion first made by Zhivlov that traditional PU *ś comes from an earlier *ć that was preserved in Samic, but replaced in areal vocabulary by a new *ć in Permic and the three Ugric branches, is probably right at least though. “*ś” is then basically a Common Nonwestern Uralic (maybe even just Nonsamic Uralic?) but not the proper Proto-Uralic reconstruction. (On structural grounds the same proposal has been made earlier also by at least Janhunen and Abondolo.)
Two, linguistic archeology cannot even in principle pinpoint an origin outside of a family’s current or historical range. Under the basic assumptions behind linguistic archeology, any terminology for e.g. natural realia exclusive to an “external homeland” would have to be either lost or repurposed in all descendants. This would even hold if one of the daughter lineages ended up re-entering the original territory. (Northern Sami speakers moving to Helsinki are not going to magically recover the lost but presumably once extant Proto-Samic words for things like ‘maple’ or ‘eel’.) Suppose for the sake of the argument that Uralic first expanded in a northward fan from someplace around the southern end of the Urals, near Orenburg or Magnitogorsk; southeast of the current range of Permic and Mari, well south(west) of the current range of Mansi. What kind of vocabulary evidence would we even expect this to leave, as distinct from an already originally more northern homeland?
But I believe that’s enough said for now on attempts to locate Proto-Uralic (again, watch for the upcoming issues of Diachronica for news on this). Going back to the Agricultural Substrate, Quiles identifies four semantic areas which would show prominent influence from this:
- tree names and related botanic terms;
In terminology related to animal husbandry and textileworking he gets together a few possible examples too, but contrasted with a more substantial number of loanwords from Indo-European.
I agree with most of these assessments as well. The one exception is apiculture, as the words actually comprising this layer (*mekšə ‘bee’, *metə ‘honey’, *śišta ‘wax’; unreconstructible #käras ‘honeycomb’ ) all have good Indo-European / pre-Indo-Iranian etymologies, unlike the vast majority of the others, and the cases of *š appearing in these can be well derived by RUKI. Even if *š might be often a marker of the Agricultural Substrate, this does not imply that all cases have to be so, and in particular this does not provide reason to abandon well-established loanword etymologies coming from actually attested language families. By a similar argument, I am likewise unconvinced with trying to reinterpret words like *šiŋərə ‘mouse’ (with regular reflexes in all three of Hungarian, Mansi and Khanty) as having anything to do with the Agricultural Substrate. The key motivation for setting this hypothesis up in the first place has after all been the highly limited distribution of words of certain semantic categories or with certain phonetic features. If we start including occasional etymologies that reach also Ugric or Samoyedic, we can no longer maintain the original explanation for why other words of this layer do not do the same (i.e. that the Agricultural Substrate was never in contact with these branches of Uralic). This indeed would come close to abandoning any reason for treating this layer as non-native in Uralic in the first place!
An additional issue that I seem to notice at this point is that, out of the possibly substratal cases of *š, quite few also occur in RUKI environments. The cluster *kš is particularly prominent: *makša ~ *mäkšä ‘rotten wood’, *päkšnä ‘linden’, *wakštVra ‘maple’, maybe *päkškV ‘hazelnut’ and *tekškä ‘ear of corn’ (surfacing as *šk ~ *kš vacillation). There is also a phonologically similar though clearly non-IE *š after *ŋ in *jaŋša- ‘to grind’, maybe also behind *riŋəšə ‘threshing ground’. Examples of *ks or *ŋs also do not seem to occur. I suspect that this points to the Agricultural Substrate actually coming to Uralic second-hand, and that it was instead first adopted into an extinct para-Balto-Slavic and/or para-Indo-Iranian language that, as expected per general Indo-European dialectology, regularly retracted *s to *š at least after velars; including in words that it had earlier adopted from the Agricultural Substrate proper. This hypothesis gives us also some more wiggle space in identifying the substrate in the archeological record: even archeological cultures that were probably Indo-European-speaking could be considered as the source.
Speaking of the ultimate identity of the substrate, Quiles has an interesting new suggestion on this, too: he seems to have found parallels for a number of the involved words in the West Caucasian language family, and attempts to sketch ways it could have been in contact with Uralic. This I think would be worth further exploring. Some more data to this effect might be also findable from Bernát Munkácsi’s 1901 monograph Árja és kaukázusi elemek a finn-magyar nyelvekben. While Uralic–Indo-European loanwords studies have been an extensive and productive field for long, on the topic of Uralic–Caucasian comparison of almost any flavor this remains just about the most recent even halfway serious overview. — Directionality, however, is not obvious to me. As Quiles notes, the WC ~ Uralic parallels center on technology and metalworking terminology. It seems to me they could be well explainable, besides pure accidental resemblance, also as a set of recent Wanderwörter, or parallel loanwords from a lost common source. There is thus barely any evidence yet to speak of a West Caucasian substrate language specifically.
By now I would have also more detailed comments on numerous individual etymologies proposed to belong in the Agricultural Substrate by one researcher or the other. This task will be best left for another time however, in many cases maybe also for another context entirely, and I might return to the topic only after having gotten more of these forthcoming etymological etc. observations out to print individually. Substrate languages are a fascinating topic, but they really are not highly feasible to tackle head-on: they emerge only from the dark corners of linguistic reconstructions, generally identifiable more by what is absent than by what is present.
 While Häkkinen continues to be active in our field and has a lot to say especially on the topic of the relative and absolute chronology of Uralic languages (recently e.g. coauthoring an article on Southern Sami with Minerva Piha in the latest Sananjalka), his PhD though unfortunately still remains unfinished.
 Part of the Finnish / Swedish grouping jalopuut, jalot lehtipuut / ädellövträd ‘noble (broadleaf) trees’. Other generally agreed members include the elm, ash, linden, beech and hornbeam. This might be convenient to calque into English too. Delimiting it in a context wider than just the Nordics has some difficulties though… would we only accept species whose distribution overlaps with the taiga zone at least within gardens, ruling out the likes of plane trees; and would we follow the main practical motivation of the term and rule out softwood broadleaf trees like the poplar?
 Nominally regardless claiming to be in the 2015 issue of Suomalais-Ugrilaisen Seuran Aikakauskirja. I wonder how often these kind of delays, between when a periodical is dated and when it actually comes out, are due to printing queues and how often due to actual editing issues.
 Mordvinic *käŕas, Mari *käräš, Udmurt /karas/; none of these can be native as such. The Mordvinic and Udm. words show a ⁽*⁾front vowel in the first syllable plus a ⁽*⁾back vowel in the second (PU unstressed *-ä- > Udm. /e/), and such disharmonic vowel combinations always result from either recent derivation or recent borrowing. The Proto-Mari vowel *ä then is non-native entirely. Probably mostly likewise for those cases of pre-Permic *ä that end up retracted to /a/.
‘honeycomb’ in Proto-Mari is *käräs, not *käräš, as evident from the entries in Beke and Veršinin’s dictionaries.
Right (ditto from Moisio–Saarinen); I need to kick this bad habit of retaining modern shibilants when choosing to condense Mari data as proto-Mari…
Wikipedia backing you up.
…In rare cases, maybe it can if the descendants move off in different directions and repurpose the same words in different ways. But of course most of the time that will just leave us with uncertain reconstructions like “some kind of tree” unless a particularly good argument about the directions of the semantic shifts can be made.
*riŋəšə reminds me: I’ve occasionally encountered the claim (probably only in 2nd- or 3rd-hand sources) that Proto-Uralic didn’t allow word-initial *r. More recently I’ve seen PU reconstructions with *r- going uncommented in the primary literature. Is it a suspected loanword marker, or was the original claim just wrong?
While I’m at it: Häkkinen’s “After the Protolanguage” paper lists a split of PU *ë into *ë and *ï, with currently unknown conditioning, as an innovation of East Uralic. Is there a reason not to assume the opposite, i.e. PU *ë and *ï, with merger of the latter into the former an innovation of West-Central Uralic…?
I don’t think even rare cases can really pinpoint a homeland, since a good level of detail requires interpolation between multiple species etc. Say, the modern Sami languages have a word for ‘large solitary tree, esp. pine’ that goes back to *aikkë and (given Germanic *aiks and the fact that large solitary pines will often gain a somewhat oak-like shape) probably actually originally meant ‘oak’. This already seems to suggest that Proto-Samic or some phase of pre-Samic at least was spoken somewhere further south where people knew of oaks, but won’t tell us if it was in e.g. Finland Proper or the Karelian Isthmus or Svealand. What are the odds we would find another similar term that denotes a species native to only one of these areas (most such species will be also absent from today’s Sápmi)?
A “common Uralic” *r- is probably indeed still a loanword marker. Examples start turning up already in Khanty though, but it remains entirely absent from Proto-Samoyedic and even from recorded Mator and old Nganasan. UEW posits some cases suggesting *r- > PSmy *l-, but all can be taken as loanwords from Khanty into Nenets or Selkup.
We seem to have good candidates as the conditioning factors for the *ë/*ï split (in my linked presentation I suggest that it’s primarily per the stem vocalism: *ë-a > *ï); and of course in Mansi this never seems to have gone this far in the first place, since both *ë and “*ï” are reflected as what I’ve argued is better considered a mid vowel *ëë. In principle this still could have come from something like an earlier more subtle split as *ë / [ɤ] ~ *ɛ̈ / [ᴧ] though. (There is some evidence of a separate short close *ï of obscure origin in Proto-Mansi, too, but this is probably too new to be relevant… there are no more than 2–3 cases that have any cognates even in Khanty.)
Regarding the alleged *ë – *ï split, I’m quite sure the idea is erroneous. There seems to be no valid evidence for such an innovation being shared by any two eastern branches – not even by Khanty and Samoyed. Häkkinen’s claim was based on just a couple of etymologies, and actually there are also counterexamples. When one looks at all relevant data, there just seems to be no pattern: in fact, Samoyed *ï can also correspond to Khanty *aa, and Samoyed *ë to Khanty *ïï, which contradicts Häkkinen’s idea. Note, for example, the following cases:
Khanty *kïïčǝl- ‘get moldy’, *kïïčïïm ‘mold’, Samoyed *këčǝ (Selkup *qëëčǝ ‘bad smell, stench’)
Khanty *aańǝj ‘not shy, allowing one to come into shooting range (of birds)’, Samoyed *ïńǝ ‘tame, calm’
In addition, Khanty *ïï is also the regular high grade alternant of original *aa in the Khanty system of Ablaut. So at least in principle, in some word-stems even an unalternating *ïï could have just developed from earlier *aa, presumably under the influence of an Ablaut trigger vowel that was a part of the stem (but later lost). Thus, Khanty *ïï can also reflext PU *a and correspond to Samoyed *a or *å (as in Khanty *kïïj-, Samoyed *kåjå- ‘leave’; Khanty *ïïŋk-, Samoyed *(ń)aŋǝ-t- ‘take off’). This, too, suggests that the issue of Khanty *ïï is not connected with the opposition between *ï and *ë in Samoyed. While Ablaut processes can hardly explain all instances of unalternating *ïï in Khanty, it seems to explain some of them at least, and this creates additional difficulties for the interpretation of the Samoyed-Khanty vowel correspondences.
Major problems are also encountered in other “East Uralic” innovations suggested by Häkkinen; I seem to recall that Mikhail Zhivlov has a very interesting presentation handout on Academia.edu about this. At least so far, the alleged “East Uralic” subgroup seems to have no good supporting evidence in the form of sound changes, and I think the idea should be rejected (until new and better evidence is perhaps produced).
Yes. He takes Häkkinen’s phonological innovations of East Uralic apart one by one, leaving nothing.
A side note regarding alleged *riŋiši ‘threshing ground, grain drying kiln’: there is, in my opinion, a quite plausible alternative alternative etymology for Finnic *rīhi : *rīhe-. It could simply have been borrowed from Germanic *rīxan- > Norwegian rjå, Swedsih rie ‘a pole on which grain is placed to dry’, Old Swedish rīa ‘grain drying kiln’. Phonologically, the etymology is neatly paralleled by another loan of the shape *-V̄he- from Germanic *-V̄xV-, namely Finnic *rūhi : *rūhe- ’dugout boat’ < Germanic *þrūxō-. "Lexikon der älteren Germanishcen Lehnwörter in den Ostseefinnischen Sprachen" vol. III accepts the latter etymology but rejects the former, but there seems to be nothing wrong with the etymology per se. It has also been thought that the Nordic words were borrowed from Finnic, but this seems unlikely; instead, they seem to be connected with a more widespread Germanic word-set with meanings like ‘pole’, ‘slat’, ‘line’, ‘row’, etc. (Kroonen, Etymological Dictionary of Proto-Germanic, p. 412). Thus, it is possible that the Permic forms compared to Finnic *rīhi are of another origin after all; and as far as I can see, nothing in the Permic data really necessitates the reconstruction of PU *ŋ in the first place.
An e-stem with *x → h regardless seems anachronistic to me. This is directly contradicted by other loans such as *xardⁱja- → kärsiä, *xaixiz → kaihi and maybe *xabraz → kapris, which are already i-stems despite still showing k- for *x-. I would actually edit slightly the Germanic etymology of ruuhi: a better starting point could be the more widely distributed *trugaz (> En. trough, Ge. Trog etc.), giving early Finnic *ruxəš or *ruxəs : *ruxəhə- > *ruuhë- (probably with an analogical nominative, instead of expected **ruus or **ruuh, as in mies from PG *mēgaz). But then something similar could perhaps also work for riihi though.
Probably some other details of this still need work too: for one, why do we have almost no evidence in Germanic for the specialized sense ‘cabin for drying grain’ but also no evidence for a more generic sense ‘pole, row of poles (for drying grain)’ in Finnic? It does look like to me that at least Sw. ria would be better treated as a (back?)loan from Finnic (much as also e.g. Baltic German Riege), given that the riihi proper has never been common in Sweden.
I don’t quite see the contradiction here. Cases of word-initial *k- for Germanic *x- cannot be directly compared to the treatment word-internal or intervocalic *x. If the Germanic etymology of kaihi is correct, it even demonstrates that *h could be substituted for intervocalic *x, although at the same time *k was substituted for word-initial *x. And even in the case of kaihi there is indirect evidence for an underlying e-stem, namely the verb kaih-ta- derived from the consonant stem. And moreover, the mere emergence of *i-stems (like kärsi- < *kärti- < *kärti(-)j-) does not imply that e-stems would have immedialtely lost their productivity in the language, and could no longer occur in new loanwords. I think it is easier just to accept the proposed Germanic etymology of ruuhi at face value instead of venturing a speculative reconstruction *ruxiš, or the like. After all, it seems rather evident that stems of the shape *CVxi- have lost their productivity earlier than e-stems during the development of Finnic; and, besides, Pre-Proto-Finnic *u seems to have been lowered into *o before *x (cf. *sou-ta- 'row', *nou-ta- 'pursue', *joo- 'drink').
Valid point about initial vs. medial *x. Lowering by PU *x doesn’t seem relevant to me though: this could well just date as early enough to not affect the issue (it certainly it does not affect stems with *ŋ, which I think also develop thru MPF “*x” or rather *ɣ on their way towards vocalization). Also the only real case to me looks like to be *souta-, which suggests that this might be limited to coda position. Is there any reason to assume a pre-Finnic *u in *nouta- or *joo- other than the Samic cognates? All other cognates point to a more open original vowel. (BTW the doublet kii-ma ~ kei-ma could in principle show that similarly *i > *e before coda *x, if going back to *kixə-ma ~ *kix-ma.)
Another issue is, why would we have an e-stem in the first place from a Germanic *ō-stem? Karsten’s original proposal (ANV 22, 176–178) has some circularity in that he assumes an earlier PG *i-stem *þrūxiz but based only on Finnic.
— Back on the Permic side, the Permyak form /ri̮mi̮š/ from Wiedemann at least would seem to point towards and earlier *ŋ.
I just want to leave a thank you message for your blog! As a person coming from a computer science background, historical linguistics satisfy me in a similar manner reverse engineering does. My knowledge is limited but I enjoy reading about the processes. I’m a native Czech speaker and a casual Estonian learner, and looking up etymologies always helps me relate and remember vocabulary.
Sometimes I wish I would have a place to bounce off some ideas I have. For instance, particularly related to agricultural vocabulary, is there a relationship between Proto-Uralic *mura ‘cloudberry, blackberry’ and PIE *mórom ‘mulberry, blackberry’? Semantically clear, but I wouldn’t know where to start looking at sound laws to determine it, and if I don’t find information at Wiktionary, Eesti etümoloogiasõnaraamat, or the Czech etymological dictionary I have, it’s as if it doesn’t exist.
I wouldn’t call ‘blackberry’ agricultural vocabulary. You don’t need to know how to farm to get blackberries. And if you are talking about the relationship between Uralic and Indo-European, there is a hypothesis called Indo-Uralic that both are related. I totally think that they do, but not everyone agrees. Unfortunately there is not much useful material out there. I may publish some things in the future, though. As for the correspondence PU *mura ~ IE *mórom ‘blackberry’, it is similar to PU *pura ‘to drill’ ~ IE *bʰor(H) ‘to pierce’ e.g. in Latin forō “I drill”.
You’re right, of course, just because it’s a botanic term doesn’t mean it’s agricultural. In fact some sort of words for berries would be among the first to evolve in a language.
I know there are theories as to some form of Indo-Uralic relationship, and I’m also familiar with the idea of early Proto-Indo-European languages serving as substrate languages causing some elementary loans into Proto-Uralic (as I understand those it seems like loans are more numerous into the latter). I guess it’s so far in the past we can only find such a limited number of examples that determining sound laws with any certainty would be impossible.
I appreciate your input with *pura. I also found that Uralonet offers some possible Turkic cognates to this word. I should sharpen up on my German so I can use Uralonet more often. I might also note this relationship on Wiktionary, since it’s currently the most accessible dictionary of etymologies and reconstructions that I know of and is available to me.
It’s indeed a classic Indo-Uralic comparison, an interesting one in that it can’t be readily taken as a loanword in either direction. Only means ‘cloudberry’ in Uralic to my knowledge though. (And I often find myself thinking “and isn’t it derived from *mor- ‘black’ in IE” before remembering that that’s a Common Elvish root…)
Welcome to the comments crew in any case! I’ve sometimes thought about opening an Open Thread for people like you to send in general comments or questions. Already doable on the tumblr side, but I think many readers on here don’t check there often.
@Howl: *pura is a bit of a different case I think. I’m not sure this can be entirely separated from *porə- ‘to bite (in)’, often used of tools (e.g. Fi. puru ‘biting; sawdust’), where we see early *o > *u in many reflexes via what I call “Janhunen’s Law”, *CoCə > *CuCə. That leaves room for the option that *porə- is a loan from *bʰorH- but the nouns meaning ‘drill’ are independent nominalizations.
>Only means ‘cloudberry’ in Uralic to my knowledge though.
To my knowledge Czech has two words that derive from PIE *mórom: moruše ‘mulberry’ and the dimunitive moruška ‘cloudberry’. Whence my entire interest in the word as I saw a potential connection with Estonian murakas ‘cloudberry, blackberry’. The dictionary term for ‘cloudberry’ is actually rabamurakas, but Estonians commonly use just murakas. (murakas also reminds me of Cz. mrak ‘cloud’, but I’ve successfully ruled that connection out myself since PSl *mȏrkъ means ‘darkness’.)
Thank you for the welcome and apologies for derailing the comments a bit! I suppose I will start collecting my questions and notes for a future open thread :)
I’m nowhere near fluent in Estonian myself, but in my understanding murakas = any Rubus species is a modern technical / botanic usage; in popular usage it is still primarily the cloudberry though also used as a part of the name of the dark red mesimurakas (R. arcticus), while ‘blackberry’ specifically is pampel.
Is “Janhunen’s Law” still considered valid? I always thought it was abandoned together with the idea of a first split into Finno-Ugric and Samoyedic. Cases like PU *woli ‘to be’ > Finnish *olla are obvious counter-examples.
I’ve seen very little later work on it at all, but at least Häkkinen’s proposal to invert it to the form *CuCə > Samoyedic *CoC does not seem correct to me (we even have the minimal pair *tuj ‘fire’ : *toj- ‘to come’). I believe a better improvement is to restrict it to *CoTə with front consonants, which deals with most of the counterexamples (e.g. *kojə ‘man’, *ojə- ‘to swim’, *kokə- ‘to go’, *toxə- ‘to bring’, *poŋə ‘bosom’, *soŋə- ‘to enter’).
‘To be’ is a bit of a complex case but most reflexes would rather point to *walə-, with Finnic perhaps showing irregular shortening from *(v)oolë-.
“I’ve seen very little later work on it at all”
I know Aikio reconstructed *puri ‘to bite’ and *tuli ‘to come’ in his UED a-ć draft.
“we even have the minimal pair *tuj ‘fire’ : *toj- ‘to come’”
Even Janhunen’s SW (p.164) has “*toj- ~ *tuj- ‘kommen'”
“‘To be’ is a bit of a complex case but most reflexes would rather point to *walə-,”
Which reflexes? PMari *ŭla and Udmurt *ve̮l exclude PU †wali IMO.
A stem *tuj- for ‘to come’ seems to be fairly marginal, but yes, would seem to point to some amount of secondary *o > *u also within Samoyedic (also e.g. in Selkup-Kamass *tutə- ‘to chew’ seemingly < *soskə-).
*a-ə in ‘to be’ is primarily indicated by Khanty *ă, Mansi *aa, Hungarian a (as already per Sammallahti). I also derive Permic *vȯl- by a rule *waRV > *wȯRV, while *o-ə > *ȯ is usually found only in closed syllables (some very preliminary discussion in my Master’s thesis p. 82). Mari does not quite fit though and F/Mo/Ma could be indeed all better derived from *wolə-. Maybe a factor throwing things off is that this is in any case probably to be segmented as *wA-lə-, given e.g. Finnic 3PS *o-n or *o-mi, the adjective *(w)o-ma ‘own’ and Ob-Ugric *wă-s-, *waa-s- ‘to be, exist’ (cf. most recently Janhunen 2020).
As far as I can see, the regular reflexes of PU *woli- and *wali- would be identical in Ugric, so the Ugric reflexes do not support one reconstruction over the other.
As regards Samoyedic *o, the issue remain unresolved, but there are several details that seem to support the view that Samoyed *o developed secondarily from PU *u. First, there are cases where *o varies with *u. Sometimes there is variation between languages (as in Nganasan tuj ‘fire’ < *toj, as opposed to *tuj in the rest of Samoyed), sometimes between the simplex root and obscured derivatives (as in Selkup *tulǝj- ‘reach’, *tulčǝ- ‘reach’ vs. Samoyed *toj- ‘come’), and sometimes even paradigmatically (as in Selkup *nom : *numǝ- ‘god, heaven’). Second, there are cases of Samoyed *o corresponding to PU *u even in stems that do not have the structure *CuCi- in “Finno-Ugric”: e.g., Saami-Mordvin *čučki ~ Samoyed *čočǝ ‘pole, log’, Permic-Ugric *pučča- ~ Samoyed *počå- ’soak, ooze’. One previously unnoticed example with internal irregular variation in Samoyed seems to be Tundra Nenets pudǝʔ- (< *pučǝ-t-) vs. Selkup *pōčǝ- (< *počǝ-) ‘unstitch, tear (at the seam)’; this must be related to PU *pučki- ‘burst’, about which I’ve written in Linguistica Uralica 1/2014, p. 11-.
Depending on what you mean by “certainty”, you might like the latest and most impressive attempt to find some more (though it’s only a conference presentation in Russian at this point). I’m not impressed by PU *č ~ PIE *r, but check out the others!
On your idea that the substrate went through an Indo-European language – the maple word looks like it contains an I-E suffix *-tro-.
Maybe! First part also looks like it might be from √h₂weg-s- ‘to grow’ (but the question is how would this add up to ‘maple’).
We can always speculate that something non-IE was folk-etymologized as PIE **h₂wékstrom ~ Proto-Indo-Iranian **wákštram “growth hormone”.
Though actually, I’m currently in a garden where maples and ashes are among the most common weeds. “Unwanted growth in a field” for “maple” is not inconceivable. The instrument suffix is wrong for that, but if it’s all folk etymology in the first place…
Agricultural substrates in Europe are a very intriguing topic!
However, the insights and theories pose, as a whole, some kind of a puzzle:
There are quite a number of unattested languages of agricultural populations from which sets of loanwords are suspected. These languages cannot be easily identified with one another.
On the other hand, genetic studies show that early farmers in Europe, with “early” extending to a stretch of ~3,000 years, go back to a very small original population migrating from Asia Minor to Greece around 6,000 BCE, with no major subsequent migrations along the same path. While not completely excluded, it would be hard to conceive that this population would have entertained more than, say, 4 unrelated languages (meaning “not confirmable as related” from the perspective of a ficitve linguist observing these languages as of the time of migration ;-) ).
Given the genetic and archeological record, early farmers were highly reserved against the indigenous hunter-gatherer population, so that an agricultural society using a language of pre-agricultural Europe is highly unlikely before Funnel Beaker time.
So it is not surprising that pre-agricultural Europe was highly diverse.
But early agricultural Europe – not so much. This leaves linguists with another major task:
Finding out which substrate languages are actually the same.
Having said this, there are several attested languages that have to be integrated in this picture, and one group of these are the North-(West-)Caucasian languages. They are likely to be connected with early European farmers in some way; therefore, identifying an agricultural substrate in (parts of) Uralic is, a priori, quite plausible.
Perhaps I don’t completely understand the genetic findings, but I know that some exclusive genetic connections have been found between the modern North-West Caucasus and the Starčevo–Körös–Criş horizon of early agriculture.
As far as I see, several scholars, conclude that both early Starčevo farmers as well as North Caucasian populations either have separated paths from an original ancestral homeland in Anatolia, or that North Caucasians have migrated to Anatolia first and then to Europe as farmers.
This seems to use the tacit and awkward, but somehow widespread axiom that, while migrations can be assumed for most prehistoric peoples without major restrictions, Caucasus populations are assumed to have been staid put all the time. However, the facts sketched so far allow another theory: Perhaps North Caucasian populations have actually migrated there from the middle-Neolithic Balkans!
This actually fits well with the theory of identifying the Proto-IE population with the Cucuteni–Trypillia culture (and later steppe cultures with IE dialect groups). In that case the somewhat enigmatic, but well-described suggested interaction between PIE and North-(West-)Caucasian may have taken place not between steppe and Caucasus mountains, but across the Northwestern edge of Transylvania.
A long story cut short, theories combining the words “agricultural substrate” and “North-West Caucasian” have a good chance of convincing me ;-)
The agricultural substrates in Germanic, Celtic, Italic and Greek are all very similar, though.
Interesting. Do you have a reference for this?
I ask because the Early European Farmers did not have any Caucasian Hunter-Gatherer = Iranian Neolithic ancestry; that came to Europe this side of Greece only as part of Steppe ancestry (which is almost 50% CHG, 50% Eastern Hunter-Gatherer).
Sergei Starostin beat you to it by over 30 years. :-) But it does not fit well with archaeogenetics.
… and Balto-Slavic and Albanian and possibly also Anatolian. I suppose you mean the “a” language with alternations between #aCC- and #CVC- (e.g. German Amsel, Latin merula).
The “a” language must have had some overlap with Linear Pottery or one of its later decendants (such as Funnel Beaker). Despite its seemingly pan-European spread, a strong presence in Central/Eastern Europe would be enough to account for the wide distribution, given probable IE expansion paths.
The “a” language cannot be easily reconciled with any attested language family. This leaves at least Basque and the North Caucasian families to account for (unless you put them down to WHG and CHG, respectively, which does not convince me). Etrucscan-Lemnian, on the other hand, fits better into an Indo-Uralic areal type, so that these languages may go back to a population migrating in connection with IE expansion (as a satellite of IE migrations, pushed by them, or following their route?). More importantly, there are several substrate remainders found in European languages that don’t relate to the “a” language, at least not certainly so – for instance, the pre-Greek (“Pelasgan”/”Minoan”) substrate, as characterized by the “-inth” ending. Pelasgian and the “a” language have been identified, but this hypothesis hinges on a single Greek word (“garlic”). I admit that’s one (grand total: 1) hint in favor of identity of these languages (language families), and none (0) against. It’s still a bit of a weak case, given how little is known about them, and taking into account that an ending can easily be copied from loan words to loan words from other sources.
My main point is: Europe has never been linguistically uniform at any point in attested history, and probably hasn’t been before.
And that is somewhat tricky to square with early farmers’ genetics.
A correction: Starčevo culture, not Starčevo–Körös–Criş.
Anna Szécsényi-Nagy: Molekulargenetische Untersuchungen zur Bevölkerungsgeschichte des Karpathenbeckens (2015)
(Quotation from German wikipedia.)
CHG doesn’t a major role for *any* extant population today, not even in the Caucasus
(the Caucasus was took heavy steppe inflow starting from Copper Age).
Conversely, of course, searching for modern populations with the most CHG ancestry points you … drumroll … to the Caucasus ;-).
Would be nice to read … but Starostin really suggested a non-Caucasus homeland for North-Caucasians?
I think theories about substrates from unattested languages are so speculative that they can hardly be called science. And there is more uncertainty in how the spread of language families matches to cultures and genes than archeogeneticists want to admit.
Take for example the #aCC- and #CVC- ablaut. The Indo-Europeanist in me sees a generalized *R̥C >#aRC, RVC ablaut in some (maybe unattested) branch of Indo-European. Yet somehow this is “Pre-Indo-European”. Because it fits some narrative. ‘Pre-Greek’ is even worse. It has an s-mobile and a nasal infix and yet somehow is definitely “Not Indo-European”. People just see what they want to see in ‘substrate’. And that is a big problem.
The idea that the alternations we see in this kind of substrate vocabulary might be phonological and not morphological is very valid. On the other hand this doesn’t have to mean they’re specifically “ablaut” (an overly vague concept as soon as we are not talking about the actually attested IE languages — if not sooner). I’ve seen people naively thinking that alternations like Estonian tütar : tütre ‘daughter’ would somehow reflect some form of IE ablaut rather than simply native syncope *tüttäre- > *tüttre-.
Hence we could e.g. entertain the idea that “movable a-” results just from early vs. late loaning of words and initial vowel loss within some of the substrate varieties. The default hypothesis for explaining vowel : zero alternations is always syncope (with some secondary weight on vowel epenthesis in favorable positions); for that matter, vowel quality alternations probably should by default be assumed to arise from umlaut (with some secondary weight on stress conditioning).
In a still closer parallel to my Estonian example, actually even initial vowel loss within the IE branches themselves could be involved. This change is one possible routing for the development of what are normally reconstructed as initial *HC- clusters: early common IE *HC- > *əC-, with initial schwas being lost in most languages but maintained in Greek and Armenian. (Also, in some cases we could then also opt for reconstructing e.g. an “original” schwa of nonlaryngeal origin, though that’s not a necessary part of the idea.)
Good idea, though Greek must have maintained at least two of the laryngeals long enough for the colored schwas to get identified with different existing vowel phonemes. On the other hand, the fact that the #aCC ~ #CVC words have specifically a could be related to the fact that all three “West IE” branches (Celtic, Italic, Germanic) have word-initial *CHC- > *CaC-, now that you mention it – i.e. *CHC- > *CHəC- > *CəC-, followed by *ə > *a.
(…or perhaps the first two steps happened in all environments, not just word-initially, and were followed by deletion of *ə in Germanic except in the first syllable, and then by *ə > *a in Italo-Celtic and Germanic separately/areally. I should look into that sometime. If that works out, the first step could be dated back very far indeed, followed by *ə > *i in Indo-Iranian and the redistribution of the allophones of */ə/ in Greek… haven’t looked into Tocharian or Anatolian…)
Howl, deriving the “a” effect from IE ablaut is intreaguing; I haven’t heard that theory before.
And yes, the suggested phological development is plausible for a hypothetical IE language – in fact, Italic and Celtic come very close.
Just to be clear: Your explanation still requires an “a” language, because the different reflexes wouldn’t be regular in the respective branches (e.g. the blackbird/thrush e.g. would yield **ums- in Proto-Germanic and OHG). So basically your claim the “a” language is an unattested IE language.
Now this language would then have all of the words (nouns) in question (somewhere between 10 and 20 in total) in two variants, namely in full and zero grade –
but with little or no suffix variation and, identical (or at least “indistinguishable”) semantics!
If zero and full grade of a root are both attested in one language in nouns, there is usually a clear and signficant derivative relation between them. Exceptions to this rule exist, but these are usually abstract nouns (e.g. via different derivative paths from a verb), not with plants and animals. Even then some suffix variation would be expected.
The only way to save this would be to assume a root noun, and ablaut occurring in the declension paradigm. This actually fits the required small semantic load of the “a” variation rather well. Moreover, cases of IE branches keeping several alternative paradigmatic levellings of the same IE word are well-established.
However, most etymons in question don’t fit the required shape of a root noun – basically, they are too long (see the extra vowel slot in the blackbird’s -Vl- suffix).
But most importantly, how could we conclude that this language is IE if no word suggested for it can be proven to be of PIE origin?
I admit this is somehow a vicious circle: By definition, we can only rely on words with the “a” effect being contained in the “a” language. And these all show the “a” effect …
However, I’m not aware of any reflex of an “a” word in any IE branch that shows the usual phological developments for that branch and _cannot_ be explain as a loan from the “a” language.
Such a finding would be an argument at least for that etymon to be of IE origin, and perhaps also the language. But this doesn’t seem to have surfaced. Strikingly, there is apparently no trace of the words in question in Indo-Iranian, Armenian, or Tocharian.
This reinforces the loan theory. And without any non-loans, it is hard to argue for a specific origin of the donor language …
So I’m afraid characterizing the “a” language as IE seems possible, but means even more daring speculation. On the positive side, the ablaut variation may be a very good model to motivate these alternations with minimal semantic variation.
Blasius, this isn’t a forum that uses BBCode, it’s a blog, and as such uses HTML directly: <blockquote>, <i<, <b>. :-)
Here’s Starostin (1988) on IE/North Caucasian isoglosses, plus speculations that this contact happened in the Balkan/Pannonia area; PDF in typewritten Russian with diacritics added by hand.
Greek has a big fat IE substrate (that it shares with Italic, among probably others); Beekes was quite wrong to lump it with the Minoan-like non-IE substrate that Greek also has. The very interesting paper is here.
That and the -(i)nth ~ (i)th ~ -(i)d language seem to have been identical not only because of the “garlic” word, but also because both extend across Greek, Slavic, Germanic, Celtic and Italic.
Incidentally, the a- has been compared to the West Caucasian 3sg possessive prefix a-. Moreover, I’ve been told that the ax, Gothic aqizi, pre-Grimm projection *agʷesi(-), makes perfect sense as a West Caucasian *a-gᶣasᶣə “3sg’s ax” (keep in mind that the vowels would come out as [œ] and [y] or thereabouts, so /e/ & /i/ would be the closest equivalents). It’s also not far away from the Basque definite article (-a, earlier -ha, IIRC reconstructible as a demonstrative pronoun *har, so it hasn’t been a suffix “since ever”).
We know too little about the “a” language to identify it with anything – given time and space, it must have been a pretty distant relative of anything known anyway, and distant relations with anything are also difficult to reject.
By the time IE showed up and the Corded Ware culture formed, the languages of the Early European Farmers had been diversifying for four thousand years. Even if the Anatolian Farmers entered Europe a single time, speaking a single language, that should mean the Steppe invaders encountered a quite diverse language family.
I don’t think Basque is WHG. I think it’s EEF. (If anything is WHG, that’s the substrate in Saami.) But given that even internal reconstruction of Basque with the help of the many Latin loanwords only gets us two thousand years back, that leaves another three thousand between it and the EEF languages that were in contact with the Corded Ware culture much farther northeast. In short, I think Vennemann’s approach of trying to find almost modern Basque in toponyms all over Europe is largely doomed even though I think Basque is the closest living relative of the languages spoken in those places immediately before IE.
CHG ancestry is about equally associated with way too many different language families from Kartvelian to Elamite, not to mention IE, that I could identify it with any.
Last I checked, the people and the cattle of Tuscany genetically came from Anatolia after all, so Rhaetic-Etruscan-Lemnian, which actually has evidence of an Italic substrate, may well have come from a non-IE-speaking corner of Anatolia (the northwest seems to be the only option) and settled in Italy pretty much as Herodotus claimed. I suspect it’s actually the closest known relative of IE (closer than Uralic), but of course that needs to be tested by a lot more research, to the extent that the short inscriptions allow that. (…Anyone have Emperor Claudius’s 12 books on the Etruscans and their language? …no?)
2015 is a bit too early to be useful – the useful data were mostly only published that year, some even later.
Near the end of the conclusion, he says the localization of these contacts will still require more work … ;-)
I’m not aware of -inth-words outside of Greek.
And given the meager set of data, being skeptical seems appropriate.
You acknowledgde that Greek shows traces of two different substrates;
so there is nothing in priciple to be said against two substrates (“a” and “inth”) in the other European IE branches, and three in Greek.
Nothing, that is, except one word.
“His/her blackbird” requires a bit more explanation, though ;-)
Btw, in Nakh languages, absolutive and ergative of personal pronouns stand quite exactly in the alternation supposed here. Again, this is a poor comparison as the ergative of “pea” is not particularly frequently used, and there is no hint that something similar once held for nouns proper.
That’s right. That is one point I usually neglect.
For our daily dose of nitpicking, there is no way to tell wether a supposed substrate language came – originally – from a WHG or EHG population. WHG and EHG mixed all over Europe, particularly in Skandinavia. The only way to be sure whether a particular language came from the East or West would be to prove a substrate from a very similar language further East or West.
We should definitely take into account the possiblility that no descendant of a CHG language has ever been attested, just as with WHG.
Really? I don’t know about genetic findings of Etruscans.
Is there really a significant genetic difference bewteen Etruscans, Romans and Sabellians?
As a reminder, we shouldn’t overstress genetics. Genetics alone won’t tell us the the origin of a language – unless people have staid isolated for an extended time.
And again, I plead for more skepticism. Raetian is so purely attested that it can’t be “decoded”, let alone clearly linked to Etruscan. I think some optimism in that direction is well-founded, but we don’t have a lot of evidence in that matter.
Did you notice you may run in some trouble here with your Out-of-Anatolia hypothesis?
Unless you suppose emigration of Tyrsenians to Analolia along with IE Anatolians.
Or do you suggest an Anatolian homeland for PIE _and_ Uralic … ? ;-)
And I do hope the html tags now work …
Oh, I must have botched the HTML. Here it is.
Well, not specifically with [tʰ]. The idea is that the original sound was mostly heard as /dʰ/ by IE speakers before Greek devoiced it.
(And if *d and *dʰ weren’t modally and breathy-voiced but stiff- and slack-voiced, for example, then it’s not necessary that the language had specifically breathy-voiced consonants.)
Yes, sorry: the trick is that 3sg possessive affixes often moonlight as definite articles or suchlike. Turkish comes to mind.
Yes. Genetically there was a continuum WHG – Scandinavian HG – Baltic HG – EHG, so obviously Scandinavian HG people are the most likely to have spoken something ancestral to that substrate based on geography alone; SHG often seems to be classified as part of WHG, though.
Sure. It’s just statistically less likely because so many language families are attested in that general region.
Me neither – this is about people (and cattle) that live in Tuscany today.
Raetian is attested in a large number of short votive inscriptions (Wikipedia: “around 280 texts dated from the 5th up until the 1st century BC”) that are so stereotyped they seem to be understood pretty well. I’ve never encountered any doubt that it’s related to Etruscan. (Wikipedia again: “most scholars now think that Rhaetic is closely related to Etruscan within the Tyrrhenian grouping […] Common features between Etruscan, Rhaetic, and Lemnian have been observed in morphology, phonology, and syntax. On the other hand, few lexical correspondences are documented, at least partly due to the scanty number of Rhaetic and Lemnian texts and possibly to the early date at which the languages split.” – accompanied by three references and a link to the “Tyrsenian languages” article, which presents a handful of cognates.)
BTW, other votive inscriptions from the same region and the same time are in a variety of IE languages that are vaguely similar to Venetic, Celtic and Italic and are not attested otherwise.
Given that the few sequenced people that were buried by Hittites lack Steppe ancestry altogether, an Indo-Anatolian homeland south of the Caucasus should be taken seriously as a possibility. (That would also explain a few other things.) Proto-Indo-Tocharian and Proto-Indo-Actually-European were spoken well north of the Caucasus, but that doesn’t mean Proto-Indo-Anatolian was spoken in the same spot.
If that works out, a Proto-Indo-Tyrsenian homeland somewhere around eastern Anatolia wouldn’t be surprising. Proto-Indo-Uralic I’d place north of the Caucasus again, but who knows.
Oh, forget about the genetics. Actual Etruscans have now been sequenced, and the results are here; the immigration from the east was real, but it only happened during imperial times and can’t tell us anything about the origin or history of the Tyrsenian language family whatsoever.
OK, I checked Kroonen’s original paper again – next to “garlic”, it is also “pea”.
Two witnesses are a lot stronger than one, so I admit identity (relatedness) of the substrates is the more likely hypothesis.