Postscript: A note on Gumuz stem structure

I just noted in the previous post that some internal reconstruction of the structure of Koman roots might be a good idea, e.g. for reducing the large stop inventory of maximally five series /p pʰ pʼ b ɓ/ (broadly retained in Uduk, Dana and Opo; aspirates lost in Koma, both aspirates and implosives lost in Gwama). However, considering for a moment also the Gumuz side, another noticable difference is not even the diversity of stops but their frequency. Proto-Koman has no shortage of stop-final roots, e.g. the smallish set of etymologies found in all five languages includes words like *bɪncʼ ‘fishhook’, *dak- ‘to finish’, *(h)ɔpʼ- ‘to sip’, *kʼʊ́p ‘head’, *kʼu₂t̪ʼ- ‘to cough’, *pʰàd̪- ‘to fly’, *sʼi₂´k ‘rat’, *sɪ`t̪ʼ- ‘to be far’, *ʃukʼ- ‘to wake someone’, *t̪ʰáɓ- ‘to kick’, *t̪ɪ´t̪- ‘to roughen stone’, *úpʰ- ‘to bathe’. This does not look to be the case in Gumuz. While we do not have a proper proto-Gumuz reconstruction to consider in full, Ahland’s grammar presents a list of 52 narrow Gumuz verb stems in appendix C (not a complete list of even the data appearing in the grammar, but an informative initial corpus anyway). Rather few of these end in voiceless stops or affricates: only √(ɓ)átʃ- ‘to hit, kick’ √faat- ‘to fall’, √ook- ‘to heat’. If following my hypothesis that *p > f, there’d be also √ʔa/ef- ‘to wash’. A decent number end in -b- (√ɗáb- ‘to find’, √ɗamb- ‘to try, taste’, √tab- ‘to be thick’, √tib- ‘to kick’) and a few in a voiced velar (√dugw- ‘to run (liquid)’, √fâg- ‘to urinate’), contrasting with a complete absense of -d- though. Among the implosives we have the inverse: no cases of -ɓ-, but three of -ɗ- (√káɗ- ‘to finish, run out’, √koɗ- ‘to skin, strip’, NG √wíɗ- ~ SG √jír-) ‘to see/check’). I suspect that the situation results from a pre-Gumuz loss of many stem-final stops plus new stem-final consonants fossilizing from morphology, yielding an uneven distribution of segments. I don’t currently want to take a stance on how precisely this would have happened though — plausibly this could be cluster simplification *C₁C₂ > *C₂, or medial lenitions, or loss in absolute word-final positions, or even some of each.

For the -b- and -ɗ- groups here I do not have firm proposals (very speculatively, maybe ‘to finish’ and ‘to check’ could contain ɗá- ‘to go’?); but good candidates for this process are provided by the system of incorporated body part terms. They seem to have grammaticalized into a wide range of functions, ranging from formation of locative prepositions / prepositional phrases, to a system of noun classifier agreement on verbs, and the discussion of all of these takes up a large proportion of Ahland’s grammar. The most general-purpose among these would seem to be -kʼwá ‘head > top’; -cá ‘eyes > front’; -sa ‘mouth > opening’; -ʃa ‘hip > base, bottom’; -tsa ‘*body > instrumental’ [1], all with also -aC allomorphs, such that in their grammaticalized functions it is essentially only the consonant (maybe also a floating tone for ‘head’ and ‘eye’) that really carries the meaning. Regular “object-incorporated” verbs maintain these as independent moving parts of the verb construction, such that person marking will follow the stem but precede the “object”, e.g. ʃá- ‘die’ → ʃá-kʼw- ‘kill’ with a 2PL imperative ʃá-cá-kʼw ‘kill them!’. However, also the majority of the basic verb stems in Ahland’s appendix that end in the consonants /kʼ/, /s/ and especially /ʃ/ show semantics compatible with the corresponding incorporated object suffixes:

  • korak’- ‘to peel’ (= ‘to peel the top, the surface’), evidently derived from √koɗ- ‘to skin, strip’, given that r is just the medial allophone of /ɗ/.
  • (√takʼ- ‘to spit’; I see no grounds to posit division as **ta-kʼ-)
  • gis- ‘to grill’ (= ‘to cook for eating’?)
  • cʼeʃ- ‘to cut’ (= ‘to cut down, thru’?)
  • gaaʃ- (< *garʃ-) ‘to grind’ (= ‘grind down’? as I’ve already hypothesized in the previous post)
  • kʼoʃ- ‘to penetrate’ (= ‘pierce to the bottom, thru’?)
  • ńʃ- (SG also ŋáʃ-) ‘to soak’ (= ‘to become wet to the bottom, thru?’), and perhaps akin to √ŋar- ‘to take, bring’

…so there seems to be good odds that at least some of these derive from shorter verb roots with an incorporated object marker fossilized to the end of the stem. In turn, any *CV roots uncovered this way could prove to correspond to *CVT or similar roots in Koman. So e.g. considering again my previously listed comparison of Koman *kʼi´sʼ- ~ (South) Gumuz cʼeʃ- ‘to cut’, if the latter is rather reduced from pre-Gumuz *kʼʲeC-ʃ-, this would then tell us nothing about what is the default Gumuz equivalent of Koman *sʼ.

If per hypothesis all sorts of pre-Gumuz root shapes like *CVp-, *CVt-, *CVsʼ- could end up reduced to just *CV- (this also would be in principle explorable even without insisting on the relationship with Koman, e.g. thru loanword studies), obviously this also increases the odds of chance correspondences. At most this could be slightly mitigated if loss of stem-final consonants maybe proves to have left some effects on tone; or perhaps yields some instances of long vowels; or has left some morphological doublets around. But it’s also a good reminder that my comparisons remain very tentative and we should keep also waiting for fuller Gumuz data.

[1] Not actually attested as an independent noun, and Ahland’s argument for treating it as specifically a grammaticalized body part rather than in origin a more abstract noun is not very clear to me (is there an assumption here that the other incorporated objects being body parts suggests that this must’ve been too?). I would propose in the first place instead ‘self’, especially in light of examples like ka-tsá-má ‘by him/herself’ (ka- comitative ‘with’, -má 3SG possessive) or fáɗ- ‘to rise’ → fád-(á)ts- ‘to get up’ (‘to rise by oneself’). The freestanding word for ‘body’ is the clearly unrelated ɓaga (also ‘person’).

Tagged with: , , , , ,
Posted in Reconstruction

Komuz sound correspondences

Another Africanist sideproject I have around, and have had for a while: bits of further development on the Komuz, i.e. Koman–Gumuz hypothesis. Given newer ongoing documentation, the relationship looks fairly clear to me, and their removal from the Nilo-Saharan macro-hypothesis should not lead to also abandoning their relationship with one another. (I’ve not done any digging into early versions of the idea though, possibly the data and/or argumentation there had still been too sketchy to suffice to convince people.)

My data in this post comes on the Koman side from Otero 2019, A Historical Reconstruction of the Koman Language Family (PhD thesis); on the Gumuz side from Ahland & Kelly 2014, Daatsʼiin-Gumuz Comparative Wordlist (manuscript), reconstructions mine. You may recall a few shorter comments of mine on both already some years ago on Tumblr ‹1›, ‹2›. Some supplemental Gumuz information and lexica have been added from Ahland 2012, A Grammar of Northern and Southern Gumuz (PhD thesis). The extinct Gule (possibly para-Koman; could be also a third Komuz branch entirely) I’ve left outside consideration so far. Presumably also considering more of the known narrow Gumuz data without Daatsʼiin cognates would help. Regardless, basic vocabulary from these sources already yields ca. 50 candidates for Komuz cognates, and this already seems to allow identifying several nontrivial sound correspondences as very tentatively regular, with 2–3 examples each. My guess is that at least a decent 100–150 Komuz etymologies could be within reach with fuller lexical documentation.

For a very short debriefing, the outline of the two families is as follows:

  • Koman: five small languages in three branches (Gwama, KomaUduk, DanaOpo), relatively diverse for its size. Otero documents two slightly differing dialects of both Gwama and Uduk, four of Opo. Further detail could be feasible.
  • Gumuz: comprises a “narrow Gumuz” dialect continuum with on the order of 200,000 speakers and one just slightly more divergent member, Daatsʼiin. Ahland primarily documents two narrow Gumuz varieties, and claims to have data up her sleeve on at least eight (with reference to a SIL field report and her Master’s thesis, both unpublished), but for today’s purposes the two varieties “Southern” and “Northern” will have to do.

The Gumuz Fricative Chainshift

Both Koman and Gumuz share the basic voiceless fricatives /s ʃ h/ (Gumuz also has /f/). Northern Gumuz diverges slightly by showing /χ/ for /h/. However, correspondences between these seem to be most often “off by one”, and it seems this can be mostly interpreted as being innovative on the Gumuz side.

Koman *h ~ Gumuz zero

  • K *haɗ- ‘he’ ~ G *ára ‘3SG pronoun’ (> D jáárʕám, SG áŋa, NG áχó)
  • K *ha- ‘to come’ ~ G *wé- id. (> D SG NG wé-)

Here I would suggest Proto-Komuz zero rather than *h. Otero does not seem to reconstruct Proto-Koman word roots beginning with *a-, while clear examples of PK *h- occur mostly before *a (the others are *haɓ ‘she’, *hag(a)- ‘to have sex’, *hasʼ- ‘to trample’). In *hàn ~ *hɪ`n ‘it’, *hʊ̄n ‘them’, *h- might be analogical from the rest of the 3rd person pronouns (but probably already in PK). [1]

Koman *s ~ Gumuz *H
Given *H > /χ/ in NGumuz, quite plausibly this *H was not [h] but e.g. [x], and debuccalization in Daatsʼiin and SGumuz has been in parallel.

  • K *bàs ‘blood’ ~ G *maHá id. (> D maha, SG mahá, NG maχá)
  • K *sɪ`t̪ʼ ‘far’ ~ G *Hát- id. (> D hááti´, SG háat, NG χát)

For the correspondence *b ~ *m in the first example, cf. also K *d̪i`bà or *d̪ɪ₂`bà ‘rain’ (Dana–Opo – competing with a better-spread PK *ʃɔkʼ) ~ G *dama id. We could tentatively notate this as Proto-Komuz *mb, although it could also stand even for plain *m, since so far I have not found any examples of a prevocalic *m ~ *m correspondence. [2]

Koman *ʃ ~ Gumuz *s

  • K *ʃa- ‘to eat’ ~ G *sá- id. (> D sa-, SG NG -)
  • K *ʃɔkʼɛn ‘louse’ ~ G *sakuna id. (> D sankun, NS sákúna; with an echo nasal in Daatsʼiin?)
  • K ? *ʃutʼ ‘rope’ ~ G *si´a id. (> D si´, SG NG si´á)

In ‘rope’ we do find front vowels also in Koman: /i/ in Uduk ʃi´, /ɪ/ in Lowland Gwama ʃwɪ¯tʼi`n. Instead of any kind of irregular palatalization due to *ʃ-, perhaps the PK should be reconstructed as *ʃwɪtʼ instead (and surely with an -ATR close vowel; *ʃu- would predict Gwama s-).

No PK **ʒ exists that could be compared with Gumuz *z (> D SG NG z; e.g. D gaz, SG gááza, NG gáánza ‘old’), and Otero notes the existence of PK *z to be questionable too, with only one cognate set with good attestation — which also means ‘chili pepper’ and is thus obviously a Wanderwort of at most some 500 years of age. (Proto-Koman is definitely much older, Otero mentions that glottochronological studies have suggested circa 5000 years; my own spitball approximation would be 2000 years at minimum.)

A further, seemingly “off by two” correspondence is Koman *cʼ ~ Gumuz *tsʼ:

  • K *cʼɛ̄ ‘ear’ ~ G *tsʼe id. (> D tsʼê, SG NG tsʼéa)
  • K *cʼumcʼum- ‘to suck’ ~ G *tsʼim- id. (> SG tsʼim-úkw-, NG tsʼiim; ?? D asám-tsa, ásám-kʼó)

but this might be actually only off by one: what Otero reconstructs as palatal stops *c, *ɟ, *cʼ are reflected in most Koman varieties as sibilant affricates or fricatives (e.g. Gwama & Komo sʼɛ̄, Yabus Uduk ʃʼɛ́, Opo tʃʼɛ̀ ‘ear’; PK *cwálá > Highland Gwama swála, Lowland Gw. swája, Komo , Yabus Uduk ʃwá, Opo tʃá ‘tree’) The palatals c, ɟ, are attested only in Dani and Uduk (c cʼ just in its Chali dialect) in the more northwestern parts of the Koman area, and while this looks like an archaism if considered purely within Koman, in an areal perspective it could represent secondary assimilation to the phonetics of neighboring Nilotic languages. Thus, shibilant affricates *tʃ, *dʒ, *tʃʼ also seem like a possible reconstruction for Proto-Koman, and for the last hence also Proto-Komuz. *tʃʼ > *tsʼ in Gumuz would then exactly parallel the shift *ʃ > *s among the fricatives.

So far I do not have Gumuz cognates for Koman *ɟ (also itself slightly dubious, though not as much as *z [3]), and only one for Koman *c, with Gumuz *s:

  • K *cikʼa- or *cɪkʼa- ‘to listen’ ~ G *gá-s-akʼʷ- ‘to hear’ (> D gásakʼo, SG gásakʼw-, NG gésakʼw-).

The comparison only works if *gá- can be parsed as a prefix of some function (e.g. reduced from *gam- ‘to know’? but whence the high tone then?). The Gumuz forms also contain incorporated *-akʼʷ- ‘head’, but this is not necessarily problem for the comparison: there’s slight evidence of a fossilized verb suffix *-kʼ in Koman, too, and PK *was → Komo wáʃ-i´kʼ ‘to boil’ might be originally ‘boil to the top’; PK *gɔ̀ɗɔ- > Dana kɔ̀ɗɔ-kʼ ‘to be deep’ might be originally ‘to be deep to the end’; where ‘to the top, end’ could be then grammatizalized from ‘head’, which is also one of the functions of this suffix in Gumuz.

I’ve collected also a number of potential but one-off Koman–Gumuz sibilant correspondences that diverge from the outline of this chainshift. Already in the absense of parallels, the possibility of false etymologies remains high, and most have in fact also further irregularities.

  • Gwama bɪ¯sʼàn ‘star’ ~ G *biiʒa id. (> SG bii´ʒa, NG biiʒa)
  • Gwama ʃɪ´ʃ- ‘to extinguish’ and sʼi´- ‘to die’ ~ G *ʃá- ‘to die, extinguish’ (> SG NG ʃá-). Working out any reconstruction here would require teasing apart fossilized morphological complexity in Gwama. It’s also unclear if this would have had PK *s or *ʃ: both yield Gwama ʃ before ɪ, s before i.
  • K *(j)Es ‘body’ (> Gwama jɪ¯s, Komo–Uduk *ɪ¯s, Dana–Opo *ɛ̀s) ~ G *-tsa id. (> SG NG -(a)ts(a), verb incorporated form)
  • K *kɛ̀s(ɛ̀)- ‘to fry’ ~ G *gis- id. (> SG NG gis-) — unclear initial and vowel correspondences, I’d definitely expect at least palatalization to **gʲis- > ɟis- in Gumuz.
  • K *Ki´sʼ- ‘to cut’ (*kʼi´sʼ- > Uduk, *kʰi´sʼ- > Dana–Opo) ~ G *kʼʲeʃ- id. (> SG cʼeʃ-) — unclear initial and vowel correspondences.
  • K *kʰɪ`s ‘new’ ~ G *kʲikʲa id. (> D jáá-cici, SG NG cicá) — assimilation *kʲ…s > *kʲ…kʲ in Gumuz?
  • K *ʃɔŋk ‘foot, leg’ ~ G *tʃogʷa id. (> D tʃugw, SG NG tʃogwa) — unclear medial correspondence.
    Alternately the Gumuz could be compared with Gwama zūgū ‘to stand’, perhaps < PK *ɟugu-. Cf. that in Gumuz the verbs for ‘to stand’ incorporate ‘foot’: D i´↓ká-tʃugwa-, SG NG i´i-tʃogw-.
  • K *ʃʊj ‘bone’ ~ G *zʷakʷa id. (> D voko, SG NG ʒákwá) — poor medial correspondence.
  • K *t̪i¯n ‘tail’ ~ G *tsiN id. (> D tsʼi´ŋtsʼi´ŋ, SG tsia, NG tsi´a) — maybe just secondary assibilation of *t before *i. Original medial perhaps *ŋ, merging into *n in Koman and lost in narrow Gumuz under palatalization? and I have no explanation for the ejective in Daatsʼiin.
  • K *t̪ʼwā ‘mouth’ ~ G *sa id. (> SG NG sa) — or maybe the Gumuz words rather to be related to *sá- ‘to eat’. (Daatsʼiin hos looks unrelated to any of these.)
  • K *wɔ̀ʃ ‘stone, rock’ ~ G *giʃa id. (> D giʃa, SG NG gi´ʃá) — poor vowel and initial correspondences.

Vowel reduction in Gumuz

Proto-Koman had clearly at least seven vowels: open/mid *a *ɛ *ɔ, -ATR close *ɪ *ʊ and +ATR close *i *u. [4] Otero identifies also a handful of mixed correspondences between *ɪ/*i, *ʊ/*u that he notates *ɪ₂, *ʊ₂, *u₂. Gumuz has just *a *e *i *o *u but also more rarely vowel length. The trivial vowel correspondences *a ~ *a, *i ~ *i, *u ~ *u between the two families seem clear. For the mid vowels there’s examples of not just the trivial *ɛ ~ *e and *ɔ ~ *o, but also mixed correspondences with *a on either side, maybe due to various assimilations (e.g. *Cʷa > Co is common even within Gumuz). However, for the -ATR close vowels, a clear correspondence seems to be Koman *ɪ ~ Gumuz *a. Probably this represents just centralization to *ə and, indeed, Ahland describes Gumuz short /a/ being qualitatively [ə]. Four examples have appeared above: *sɪ`t̪ʼ ~ *Hát- ‘far’, *d̪ɪ₂`bà ~ *dama ‘rain’, *cɪkʼa- ~ *gásakʼʷ- ‘to listen’, ʃɪ´ʃ- ~ *ʃá- ‘to die, extinguish’. Two further good examples are:

  • K *kʰɪ´- ‘to give’ ~ G *kʲá- id. (> D kjá-, SG NG cá-)
  • K *pʰɪ´- ‘to drink’ ~ G *fa- id. (> D fa-, SG NG fá-)

In Koman *ji`ɗV́ ‘water’ (three different 2nd-syllable vowels appear: Gwama i`jáʔ, Komo–Uduk *ji`ɗɛ́, Dana–Opo *ji`ʔi´) ~ Gumuz *aʄa id. (> D áʔéé, SG NG aja), possibly there has been an assimilation *jɪ- > *ji-. Original *ɪ- might be preserved in Gwama ɪ`ʃɪ` ‘wet’ besides Komo jɛ̀ʃ, Uduk jɛ̀s, for which some morphologically obscure relationship with ‘water’ seems likely. I could also suggest that, much like *kɪ > *kʲa- in ‘to give’ and, I would think, (? *ʃɪ- >) *sɪ- > *sʲa- > *ʃa- in ‘to extinguish’, Proto-Komuz *ɪ triggers here palatalization *ɗ > *ʄ, which then results in the correspondence Daatsʼiin ʔ ~ narrow Gumuz j. But I have no exact parallels for this, so this whole Proto-Gumuz reconstruction remains speculative (the closest would seem to be ? *raʄii or just *raʔii > D ja-raʔii, SG ŋii´, NG χii´ ‘black’). And another option could be to align these cognates differently: Proto-Komuz *jɪ, with a suffix *-ɗV in Koman and a prefix *a- in Gumuz.

For Koman *ʊ I don’t have much candidates for Gumuz cognates, but what little there is also points to Gumuz *a as its counterpart, and now with labialization rather than palatalization appearing on nearby consonants. Cf. above *ʃʊj ~ *zʷakʷa ‘bone’, as well as K *kʼʊ́p ‘head’ ~ G *kʼʷá id. (> SG i´l-kʼwá, NG li´-kʼwá), i.e. < Proto-Komuz *ʒʊC, *kʼʊC? Alas both comparisons uncertain due to problematic correspondences of medial / final consonants.

There is also some weaker evidence for a correspondence Koman *ɪ ~ Gumuz *i: cf. above bɪ¯sʼàn ~ *biiʒa ‘star’, *kʰɪ`s ~ *kʲikʲa ‘new’, and perhaps Opo tʼɪ´rá ‘rope’ ~ G *ti´rágʲá ‘root’ (> D ti´rági´, SG ti´ŋáɟa, NG táχáɟá). The first one is only attested from Gwama though, where ɪ ʊ sometimes appear also in correspondence to i u in rest of Koman (= Otero’s *ɪ₂ *u₂; surely the first should be rather called *i₂?), and this could be then such a case, where Proto-Komuz and perhaps even Proto-Koman rather had *i. I’m not highly convinced if the other two are even correct, and also all sorts of conditional excuses could be contemplated (*ir > *ɪr in Koman, *CʲɪCʲ > *CiC in Gumuz?). A clear case of *i₂ corresponding with *i is at least K *ɓi₂d̪a ‘neck’ ~ G *ɓia id. (> D ɓi, SG NG ɓia).

A couple examples have yet other mixed correspondences, such as Koman *ɛ, *ɔ ~ Gumuz *i, *u (*kɛ̀s(ɛ̀)- ~ *gis- ‘to fry’ above; K *kɔ́d̪ ‘breast’ ~ G *kúá id. > D , SG NG kúá), but all of these are so far too weak to treat as regular.

Phonation correspondences

The Koman and Gumuz languages are united by robust presence of ejective stops, which clearly reconstruct at least to both groups’ respective proto-languages. This appears to be unique among the putatively “Nilo-Saharan” language groups (though certainly also areal — ejectives abound also in Cushitic and all of the putatively–Omotic language groups). Going by the PHOIBLE data, the only other appearences of ejectives in the NS area are isolated: Olu’bo in Central Sudanic; Ik in Kuliak; and the complete isolate Berta. But the Koman and Gumuz ejectives probably do go back already to Proto-Komuz. I’ve covered *cʼ or *tʃʼ above (‘ear’, ‘suck’), and we can establish elementary regularity also for *kʼ:

  • K *kʼaw ~ *kʼwa ‘dog’ ~ G *kʼawa id. (> D kʼaw, SG NG kʼóá)
  • K *kʼʊ́p ‘head’ ~ G *kʼʷá id. (see above)
  • K *ci/ɪkʼa- ‘to listen’ ~ G *gá-s-akʼʷ- ‘to hear’ (see above, maybe not independent from ‘head’)

Gumuz has also /cʼ/; this seems to arise by palatalization from *kʼ. One comparison with Koman that might confirm this is *Ki´sʼ- ~ *kʼʲeʃ- ‘to cut’ (see above), another is K *tʼɪ´kʼá ‘heavy’ ~ G *Hicʼa id. (> D áhicʼa, SG hicʼ) (but *tʼ ~ *H looks suspicious). Likewise /c ɟ/ < *k *g, in some cases preserved as velar already in Daatsʼiin, e.g. PG *kʼʷákʲá > D kʼokéé, SG kʼócá, NG kʼwácá ‘eye’ (probably originally a compound ‘head-eye’, cf. narrow Gumuz -(a)c(á) as the verb-incorporated form of ‘eye’); PG *tsʼeŋgʲa > D tsʼeiŋgi, SG NG tsʼênɟa ‘leaf’.

So far I have just one *t̪ʼ ~ *tʼ correspondence: Gwama tʼákʼál, tʼákʼɪ´ ~ Komo–Uduk *lɛ̀t̪ʼV, Dana–Opo *lɪ`t̪ʼá ‘tongue’ ~ G *tʼa(tʼa) id. (> D tʼatʼé, SG kʼó-tʼátʼá, NG kʼó-tʼa). These suggest Proto-Komuz *tʼá, extended with various morphology, most clearly in narrow Gumuz by compounding with *kʼʷá- ‘head’. The *lɛ/ɪ- forms in Koman find no subfamily-internal explanation, but they are reminescent of Gumuz free-standing body part terms formed from a base SG ii´l-, NG li´- plus the corresponding verb incorporated form, discussed fairly thoroughly by Ahland in her grammar (in NG: li´kʼwá ‘head’, li´cá ‘eye’, li´sa ‘mouth’, li´ta ‘nose’ — alas it seems no **li´tʼá ‘tongue’).

For Koman *sʼ (reflected as tʃʼ in Opo, so maybe rather an affricate *tsʼ) I only have a few unsystematic Gumuz correspondences as tallied in the sibilant residue list; and for *pʼ in either family I have no comparisons.

Implosives, too, are present in both Koman and Gumuz (only *ɓ *ɗ, as usual). No particularly clear picture about these yet either. I’ve already suggested above *haɗ- ~ *ára ‘3SG pronoun’, *ɓi₂d̪a ~ *ɓia ‘neck’. Some possible further comparisons are:

  • K *bàj- ~ *ɓàj- ‘to be wide’ ~ G *fag- id. (> D á-fág-ááʔii´l-; SG fag-ii´l- ‘widen’)
  • K *t̪ʰáɓ- ‘to kick’ ~ SG tib- id.
  • K *D̪à- ‘to go’ (as if *jà- in Komo–Uduk and Opo, but Dana d̪ā-) ~ G *ɗa- id. (> D ɗa-, SG ɗá-, NG tsá-)
  • K *ɗúbá- ‘to be tasty’, or alternately *t̪ɛ̀mɛ̀- ‘to try’ ~ G *ɗamb- ‘to try, taste’ (> SG NG ɗamb-)

Koman shows even a fifth, aspirated stop series. Examples so far suggest *t̪ʰ *kʰ corresponding to Gumuz plain voiceless *t *k, but *pʰ to Gumuz *f. Besides *pʰɪ´- ~ *fa- ‘to drink’, cf. also K *pʰuj- ‘to blow’ ~ G *fúj-tʃʼ- id. (> D fútʃʼa-, SG NG fwi´tʃʼ- [5]). I have found no *p ~ *p correspondence so far though and I would consider the approach that Proto-Komuz did not have an aspirate series, that voiceless stops by default develop aspiration in Koman, and the PK voiceless nonaspirates and Gumuz /p/ are secondary developments. Some internal reconstruction of root structure constraints in Proto-Koman might be required here for progress. It already seems e.g. that aspirate stops do not co-occur with voiced or plain voiceless stops, though this is a bit hard to tell with most languages neutralizing many voicing contrasts stem-finally.

Varia

Instead of continuing to repeat “these correspondences look plausible, but more data is needed”, I will move to presenting two final nontrivial correspondences:

Koman *j ~ Gumuz *g

  • K *ji´lɔ́ŋ ‘shadow’ (Dana–Opo) ~ G *masan-gʲila id. (> D masaŋgil, SG masáánɟilá, NG maasánɟi´lá) — though I don’t know what *masan- would be.
  • Uduk jɔ̀r- ‘to grind’ ~ G *gar-ʃ- id. (> D garʃ-, SG ganʃ-, NG gaaχʃ-); *-ʃ- is the incorporated form of *ʃa ‘hip’, per Ahland grammaticalized in a meaning ‘down, to the base’ (i.e. ‘grind down’).
  • K *bàj- ~ *ɓàj- ‘to be wide’ ~ G *fag- id. (see above)

In the absense of good evidence of prevocalic *j ~ *j (though see discussion of ‘water’ above), we could in fact consider that Proto-Komuz *j > Gumuz *g; or follow a compromise reconstruction as something like *ɣ. Plain *g does not seem enticing, given K *gàm- ‘to find’ ~ G *gam- ‘to know’ (> D gama-, SG NG gam-) and perhaps Gwama zūgū ‘to stand’ ~ G *tʃogʷa ‘foot, leg’ (see above).

Koman *r ~ Gumuz *r
This correspondence may seem trivial, but complications occur within both branches. In Gumuz, *r is retained as such only in Daatsʼiin, and seems to gutturalize to yield narrow Gumuz *ʁ > SG /ŋ/, NG /χ/ and in some further dialects zero. Ahland in her grammar for some reason prefers to reconstruct here *k — I cannot understand the logic of this, as this would leave without any explanation the correspondence k ~ k, e.g. SG NG ook- ‘to heat’; SG magókwa, NG magáákwa ‘night’; SG NG ʒákwá ‘bone’. Cognates in Daatsʼiin that show /r/ clearly settle the issue I believe: see above *ára ‘3SG pronoun’, *raʄii ‘black’, *ti´rágʲá ‘root’, *garʃ- ‘to grind’. Following this change, narrow Gumuz has no more /r/ left it seems, at most [ɾ] as a medial allophone of /ɗ/. The comparisons I’ve presented for ‘root’ and ‘to grind’ then suggest that *r is also already the original Proto-Komuz starting point.

In a presentation from 2013 (handout: The Status of Gumuz as a Language Isolate), Ahland has proposed, with seven examples, that SG /ŋ/ ~ NG /χ/ would correspond with Gwama /j/. I do not think this is incompatible with a reconstruction as *r: stem-final Proto-Koman *r is apparently unstable in Gwama, and Otero’s data has examples like *D̪ir ‘green’ > zi^, *kɔ/ʊr ‘chief’ > Highland ʊ̄-kʊ̄l, Lowland ʊ̄-kwɪ`, *ɗar- ‘to send someone’ > Highland tʼál-à, Lowland tʼáj-à. The initial development is probably a merger with *l (which also shows stem-final unstability, e.g. Gwama pɪˇ- ~ Opo pʰál- ‘to be spicy’), followed by *l > j fairly consistently in LoGw., sometimes also in HiGw., and lastly, often vocalization entirely. The step *l > j is also definitely the case in three of Ahland’s examples:

  • for her ɔ́ɔ́yɔ̀ ‘clothes’, Otero has HiGw. ɔ̀lɔ̀, LoGw. ɔ̀jɔ̀;
    (allowing for final vowels to be later suffixes, I would suggest Proto-Komuz *or > K *ɔr; ~ G *or-a > narrow Gumuz *oʁa > *aʁʷa > SG aŋwa, NG aχwa)
  • for her pày- ‘to fly’, Otero has HiGw. pāl-, LoGw. pāj-;
    (the other Koman languages then point rather to *pʰàd̪-, though Proto-Komuz *-d- > Gumuz *-r- would be conceivable)
  • for her ùhày ~ ùyáà ‘3SG pronoun’, Otero has HiGw. ʊ̄-hāl, LoGw. ʊ̄-hāj (masculine, as noted above).

A fourth case where /j/ might be possible to demonstrate as secondary might be her kɛ́y- (Otero: kɛ̌-) ‘to sweep’, maybe compareable with other Koman *gɛ̀ɗɪ`ʃ ‘broom’, but this would need tone and morphology issues to be worked out. Still, chances exist for aligning this with the K *-ɗ- ~ G *-r- correspondence that I propose for the 3SG pronoun.


In closing

Altogether I believe I’ve demonstrated here, with bare minimum of regularity, at least the following Proto-Komuz segments (including also some trivial correspondences I’ve not commented on in detail):

  • pulmonic stops *p *b *t *d *k; maybe *g;
  • glottalic stops *tʃʼ *kʼ; maybe *ɗ;
  • fricatives *s *ʃ *ɣ;
  • sonorants *m *l *r;
  • vowels *a *e *ɪ *i *o *u; maybe *ʊ.

with suspicions of additional segments including at least *ɓ *tʼ *tsʼ *n *ŋ *j *w.

Time will show how this outline will change — surely it will — once data on Gumuz will be better available and once other comparative work progresses (e.g. on language contacts with Omotic, Nilotic, Oromo, etc.)

[Edit 2024-04-22: see now also a few follow-up thoughts about stem structure in Gumuz.]

[0] Abbreviations thruout are hopefully mostly foolproof, but bear in mind (P)K = (Proto-)Koman, not (Proto-)Komuz; and D = Daatsʼiin, not Dana. — The overall family will probably need a better name than “Komuz” eventually, which seems to be simply inviting confusion with “Koman”. One option will be of course the un-snappy “Koman–Gumuz”.
[1] There is no general PK rule against onsetless syllables, and at least word-initial close vowels are tolerated: PK *iʃ- ‘to lie’, *ɪ₂´- ‘locative’, *ɪ₂lɪ₂l- ‘to ululate’, *uD- ‘to follow’, *úpʰ- ‘to bathe’. h-epenthesis could also apply before *ɔ, for which we find no well-distributed word-initial examples, but however *hɔcʼ- ~ *wɔcʼ- ‘to bite’, *hɔpʼ- ~ *wɔpʼ- ‘to sip’ = rather PK *ɔcʼ-, *ɔpʼ- with differing initial epentheses among the reflexes? Both show h- in Dana and Opo, w- in Komo. Uduk has wɔ̌cʼ- but kʰɔ̄bɔ̄s-; the latter perhaps prefixed or with irregular fortition. I have no real hypothesis off the cuff for what might happen to (pre-)PK initial *ɛ-.
[2] Prevocalic *m also seems not highly common in Otero’s PK data. His table on pp. 608–609 shows 12 clear examples of *mV versus 9 clear examples of coda *m and 3 examples of *CVm ~ *CVmV vacillation where the final vowel might be morphological (certainly so at least in *ʊ-kam ~ *kam-ʊ ‘brother’); compare with a much more onset-centered ratio of 18 : zero : 2 for *b. I would also treat as likely post-Proto-Koman arealisms the following: *Gwama ‘ethnonym’; *mɪmɪ ‘mosquito / firefly’ (only in Gwama, Komo and one Uduk dialect + contrasts with PK *taʃ ‘mosquito’); and treat Komo ʃúmákʼ, Uduk si¯māʔ ‘bone’ (versus Gwama si´, Dana ʃʊ́j, Opo sʊ́j) as not representing a longer PKoman variant **ʃʊjmakʼ, but as compounds of original *ʃʊj with a loanword *makʼ < ? *mekʼ, presumably from North Omotic (e.g. Ometo mekʼeta; this root has diffused also west into Cushitic: Highland East Cushitic *mikʼe, Dullay *mikʼe, Yaaku močʼo; possibly some further areal relationship also with Hadza mitɬʼa, Dahalo mittɬʼo). Regardless, currently I would find it premature to propose Komuz *mV > Koman *bV without exception, as conditioning by e.g. a preceding PKomuz oral vowel could be entertained.
— A third example of denasalization could be K *bwàkʼ ~ SG maaʃi´-ukʼw-, NG maʃii´- ‘to hide’, but this shows also *mb- in Daatsʼiin (mbaʃi-) and the correspondence *kʼ ~ *ʃ has no parallels (though it plausibly represents some form of palatalization in Gumuz).
[3] Otero has six examples, of which the cultural items *ɟana ‘sorghum, millet’, *dʊ́ɟɛ̀ ‘pipe’, and the proper names *ɟaŋgaj ‘Nuer’, *dáɟV ‘Dazu’ could be readily suspected to be post-PK Wanderwörter. The only real core vocabulary item is *ɟà- ‘to dig’ (only attested in branches that devoice obstruents, but low tone in Gwama ʃà- requires a formerly voiced initial; after *voiceless initials, *low tone shifts to mid, one of the nicer soundlaws found by Otero).
[4] In my current blog theme, ɪ does not maintain its usual “forced serif” IPA shape but seems to only render as dotless i. To make it obvious which is which, I’ve moved all tone diacritics off of these vowels, thus e.g. ɪ´, instead of the basically indistinguishable ɪ́, í.
[5] The SG and NG forms appear as **fwi´tsʃʼ- in Ahland’s comparative wordlist, but I suspect this is just an error; examples in her grammar have instead fwi´tʃ- and fwi´tʃʼ-.

Tagged with: , , , , ,
Posted in Reconstruction

Reconstruction in dialectology: some problems of Mordvinic shibilants

Exposure to the comparative dialectology of Finnish is good for spoiling your expectations: “Here, have dozens of monographs working out the history of individual dialect groups in detail or the history of every dialect in outline, also here’s our six-digit-strong lexicographic archive covering every known majority-Finnish parish, and if you want to add to the work, feel free to start from our reconstruction of Proto-Finnic, unflinchingly stable [1] for closer to a century and a half now.” Against this kind of a standard, even any other Uralic languages with remarkably well developed dialectology (whether close by like Karelian or further off like Khanty) are going to seem middling, and cases that would seem like a very good start by global standards are actually going to look pitifully short on some measures.

This has been in the back of my mind in recent times, on and off, especially with respect to the Mordvinic languages. I’ve recently finished an article outlining several specific issues of Mordvinic dialectology that seem to have been missed so far — more on that once it’s out in print, which should be in a few months (also since it’s in Finnish and would probably deserve an English summary somewhere [2]). The most overarching disconnect is clear though. Existing dialectological work comes primarily from Soviet times, including the base data, and while native scholars have done good documentation work, this still neglects almost entirely anything that came before. First of all this means the large pre-WW1 dialectological collections, the lion’s share of it only recently fully published in H. Paasonens Mordwinisches Wörterbuch (MWB). Second of all, there has been little to no engagement with what general Uralic comparative research has to say on the history of Mordvinic.

In the 21st century, we do have these on hand now, and we could proceed on many fronts. For apparently about a decade now (oh how the time flies…), I’ve had the MWB data in xml form (and you can too!) and have been every now and then editing it down. The long-shot aim is creating a kind of a comparative or etymological reference edition, aimed more for the comparative Uralicist; e.g. cleaning off recent Russian loans into their own section, adding tabular formatting. Much more recently I’ve started to compile dialect group notes too. Clearly a good idea, as this has already inspired the forthcoming paper. Many other observations accumulating there might also make for some more small articles eventually; or maybe I’ll just publish the notes themselves as a dialectological outline of comparative phonology and lexicology; we’ll see. Hopes include also maybe managing to nudge a few proper Mordvinic specialists to work on in-detail studies of the historical development of some of the more interesting-looking dialect groups. I mean I could surely do decent work on this myself too, but as more of a generalist, I imagine not keeping up the motivation to spend years of work just on a single dialect group.


The third piece of the foundation is however still missing: a solid Proto-Mordvinic reconstruction that could serve as the starting point for historical dialectology. There are versions of PMo. out there, but it seems without having been based on sufficient dialect evidence on all details. Lots of general Uralic studies even simply keep citing correspondences with standard Erzya and standard Moksha. As a general Uralicist, my hope is of course that closer attention to dialect data could translate also into some new insights on general Uralic reconstruction (etymological, phonological, morphological, I’ll take whatever turns up). But the biggest repercussions are definitely for Mordvinic dialectology itself.

Interdialectal reconstruction is needed in historical dialectology above all for rooting: dialect data, compiled directly, will only give correspondence patterns, and these do not necessarily indicate which side will be innovative vs. archaic, although most simple isoglosses will have indeed this shape. [3] The same is of course true in all comparative linguistics, and working out which reflex (if any) is the closest to the proto-language requires reference to many lines of evidence…

In the reconstruction of shallower subgroup protolanguages, though, outgroup evidence might provide very clear arguments on this. E.g. the standard Mordvinic reflexes of the Proto-Uralic cluster *pt are Erzya vt ~ Moksha ft (or, when palatalized, ~ ; but I will skip writing out palatalization dublets all thru this post). Erzya here is a clear outlier with its voiced first member of the cluster, all the rest of Uralic uniformly shows only voiceless default reflexes like pt, tt, t. Yet earlier reconstructions of PMo. posit instead *vt and devoicing in Moksha — which, sure enough, has extensive parallels for devoicing of clusters (e.g. *nt > tt, *lt > l̥t), so we could still predict ft without any additional assumptions. Still we can already ask, what benefit would this have over a PMo. *ft which is only lenited further in Erzya?

Secondly subgroup reconstruction preferrably should be based on all of its members equally! In reconstruction all the way down to the interdialectal, this step often goes awry with much more attention paid to “standard” dialects than peripheral ones. Such work might not be even registered as reconstruction. Currently there is remarkably little work that explicitly talks of “Proto-Erzya” or “Proto-Moksha”, although if we accept a division of Mordvinic into exactly two languages, this basically implies that such entities would have once existed. [4] The case of *pt is a good example again: while Moksha has ft everywhere, Erzya has vt only among some central dialects, nearly all peripheral varieties (whether northwest or southeast, far western or far eastern) have ft systematically or (in the diaspora dialects of Tatarstan etc.) in mixture with vt forms. Some individual cases perhaps might be explainable as Moksha influence, but the big picture clearly really points to *ft still in Proto-Erzya, and vt being a recent, localized innovation of just the central dialects.

Thirdly, any histories we reconstruct for one feature will have implications for the reconstruction of others, too. If Proto-Erzya had *vt, we could hypothesize some or all of the *ft dialects going back to a slightly later “Proto-Marginal-Erzya” of some sort. However if PEr. had *ft, then they can be also entirely independent archaisms, and also any other feature with some kind of a “Marginal Erzya” distribution could be now also suspected to be more likely archaic rather than innovative.


Sometimes this third type of evidence also seems to be the best evidence that we can really rely on! Directionality is easy in historical phonology only when dealing with mergers, anything else allows at least suspicion of development in the other direction too. Typology of sound change may suggest directionality at other times, but is less binding in general, and could be outright contradicted by other evidence. And the worst examples might be any cases where *X and *Y turn out to merge as [X] ~ [Y] in some transparent distribution. Did they merge first and split later, or did one of these split-with-merger first, and then the other do the same later?

A good Mordvinic case in point are the voiceless postalveolars č, š. In basic word-initial positions we find the correspondence Erzya č ~ Moksha š, neatly across all dialects. [5] General typology would suggest that deaffrication *č > š sounds more likely than the reverse. A look at wider Uralic however shows that this turns up not only for Proto-Uralic *č (e.g. *čupa > Er. čova ~ Mk. šuva ‘thin’), but also for *š (e.g. *šiŋərə > Er. čejeŕ ~ Mk. šejər ‘mouse’). So we definitely need two sound changes in any case: *č- >> š- leading to Moksha, *š- >> č- leading to Erzya. Can we reconstruct either of them already in Proto-Mordvinic (and end up with either *č > *š > č for Erzya or *š > *č > š in Moksha for also what otherwise look like retentions)? No way to tell from just this evidence, it seems, since no contrast remains. (In principle it’s even possible that there is no one true answer: maybe variation existed already in Proto-Mordvinic, and/or maybe there are dialects that took different paths towards the same situation.)

There are more details to be dredged up, however. One point already noted by Keresztes in 1988 [6] is that any word-internal postalveolar conditions a word-initial š- also in most dialects of Erzya, thus e.g. (PU *čača- >) šačoms ‘to be born’ besides standard čačoms, šakš ‘pot’ besides standard čakš, (PU ? *ša/ožə >) šuž ‘barley’ (also in standard Erzya) besides a few central dialects’ čuž. The wide distribution is probably in favor of an already Proto-Erzya assimilation or blocking condition, and additional *š > č in central dialects. Other evidence exists too of a change *š > č specifically in some other dialects, too. A recent find of mine has been a change to this effect in the diaspora dialect of Večkanovo (close to Buguruslan) word-medially when following a resonant, e.g. nolčtams ‘to lick once’ (elsewhere nolštams, a momentane derivative in PU *-šta-), even in some presumably recent Russian loans like starčina ‘village foreman’ (← старшина).

Alas, while these are examples clearly in the direction *š > č, this still somehow seems to tie into a wider pattern in Erzya dialects of sibilant / affricate vacillation after the coronal resonants n r l + followed by a vowel. E.g. the reflex of PU *künčə ‘nail’ is standard kenže, but many dialects show instead an affricate, kenǯe; these could be plausibly archaisms, especially since an affricate is attested also from Moksha (still written кенже however). And to make sure nothing stays too obvious, also in the other direction: western PU *wenəš ‘boat’ yields standard Erzya venč, versus only dialectally the plausibly archaic venš, in agreement again with also Moksha, which has veńəš. The dialect distribution of these phenomena (also e.g. ~ ) looks roughly the same regardless: any one dialects tends towards either affricates generally or sibilants generally. So this shows the same problem as we had between Erzya č ~ Mokša š: did the phonetic categories *R+affricate, *R+sibilant merge already in Proto-Erzya, or only accidentally later on? and if the former, which way? At this point it would be handy to be able to look over these dialect groups in better detail. Does either the affricate group or the sibilant group show features that can be more securely identified as shared innovations? If one (and only one) side did, that could indeed allow putting now more trust on which way did Proto-Erzya swing here. Unfortunately we do not have this information readily available; the pre-existing work is rather too happy to split Erzya into quite a few basic dialect groups with no firm stances on their finer relationships. So goes it: many nice-looking but overly synchronic models of dialectology may quickly prove inadequate for answering any kind of diachronic questions.

More hints exist still, e.g. from the northeastern dialect of Alatyr I have found kiľč ‘skin’ (elsewhere keľkš) and kiľč ‘snare’ (elsewhere kiľkš)… I figure this represents early cluster reduction word-finally, circa *CkS > *CS, as other data from the dialect like keŋš ‘door’ (elsewhere keŋkš), luč ‘shell’ (elsewhere lučks), paŋs ‘patch’ (elsewhere mostly paŋks) seems to show too. [7] After this, we can posit either a re-application of *Rš > , or maybe the arrival of general *Rš > as an areally diffused innovation among some dialects. Again a point ever so slightly in favor of *š in Proto-Erzya?

As the matter stands, I’m still raking together the data and puzzling over it. But it’s very obvious that, methodologically, figuring out the true reconstruction of various shibilants in Proto-Mordvinic or Proto-Erzya (*č or *š, * or *, etc.) will require close attention to several types of dialect data, several only tangentially related dialect issues, etc. Before reaching a firmer conclusion, it would be dangerous to make too much of this data for arguing any particular historical connections between dialects. This is an interesting and open type of question in particular for the numerous eastern diaspora varieties of Mordvinic, many of which might be better-documented in MWB than in Soviet-era work… but it’s humbling to run into reminders that, no, a couple surface-similar dialect forms that differ from the standard language aren’t actually worth much until it’s been shown if they are innovative or archaic, and so we really do need more work on Proto-Mordvinic first of all.

[1] Actually it turns out that, from the viewpoint of Finnic as a whole, it’s more of a Proto-North Finnic, but for Finnish dialects this fact still hardly rocks the boat at all.
[2] Or better yet a summary in Russian, or indeed Erzya or Moksha; none of which I could provide myself though. But then also a lot of the comparative Uralistic research on Mordvinic has been done in either German or Hungarian so far, and even English would be a step up in accessibility to the native speakers.
[3] Synchronically this may not matter at all of course. Archaisms can easily rise to sociolinguistic identity markers, if prominent, for example the “addition” (from a historical viewpoint rather: retention) of –h– in various unstressed syllables in northern dialects of Finnish. A lot could be also said here about the various uses or demands of dialectology and which of these really need the historical angle, but I’ll leave that task for someone / sometime / somewhere else.
[4] Indeed I find it unclear if this has ever been explicitly substantiated, or merely assumed based on the binary ethnic division of the Mordvinic speakers into Erzyas and Mokshas. Some of my early looking into Mordvinic dialectology was motivated by checking if this is really true or not… it does seem to hold up, but the division must’ve originally been much more shallow than the current-day differences between standard Erzya and Moksha (see next footnote).
[5] In fact it’s close to being the only clean and regular phonological isogloss that also separates all of Erzya from all of Moksha. So far it seems almost any other one I look into comes up with some dialectal exceptions that suggest that it does not quite go all the way back to Proto-Erzya or Proto-Moksha — or extends further out, also into dialects of the other language, such that areality or parallel innovation becomes at least a possible suspicion.
[6] Keresztes, László 1988: A mordvin nyelvjárási č ~ š megfelelésről. – Domokos, Péter & Pusztay, János (eds.): Urálisztikai tanulmányok 2, 207–213. Budapest.
[7] Actually my initial thought was direct assimilation *kš > č after a coronal — but other data from the dialect contradicts this, e.g. palkšnoms, a frequentative derivative of ‘to burn’.

Tagged with: , , , , ,
Posted in Methodology, Reconstruction

Against North Afrasian Palatalization

Despite decent acceptance as a real language family, the state of Afroasiatic reconstruction remains very precarious. Not many basic sound correspondences have been generally accepted, perhaps some trivial ones such as broad stability of the “standard” sonorants *m *n *l *r *j *w and the “standard” stops *b *d *g *t *k (probably also *p, though later widespread lenition to f gives some complications, e.g. the possibility of setting up also a “new” *p for Cushitic). Some of these correspondences come from a small handful of obvious-looking basic etymologies, including the core morphological evidence that seems to be what most compels people to accept the family (remember that all solid morphological comparison must involve some material etymological comparison too, not just structural arguments). There also seem to be some well-accepted partial reconstructions, such as the existence of at least four laryngeals *ħ *ʕ *h *ʔ, preserved at minimum in Semitic and (East and South) Cushitic, but with open questions about what happens to them elsewhere.

Much more specific PAA reconstructions regardless have been attempted too. Two are relatively well-known and stand as the starting point for most ongoing discussion, those by Vladimir Orel & Olga Stolbova on one hand, Christopher Ehret on the other, both from 1995. A third has also appeared some years ago though, and has gone with little attention it seems: Allan Bomhard’s, best accessible in a self-published book Afrasian Comparative Phonology and Vocabulary (2014, 393 pp.) The contents per se are not strictly self-published however: it represents essentially an editing-down of the Afroasiatic sections of Bomhard’s much longer Reconstructing Proto-Nostratic, published in 2008 in the Leiden Indo-European Etymological Dictionary Series. There’s just a few additional details in exposition, not really any in the reconstruction AFAICT. Still interesting at minimum as a selection of data.

Bomhard does not have too many new phonological correspondences of his own, but makes some attempt for a synthesis, broadly siding with O&S in consonantism, Ehret in vocalism. One feature however stands out as altogether innovative: support for a rule of velar palatalization in a North Afrasian subgroup comprising Semitic, Egyptian and Berber. [1] This is referenced back to J. Vergote, [2] who comments on this as a much more limited isogloss of Semitic ~ Egyptian ~ Beja g. If true, this would be a novel non-trivial isogloss for the family, and could mark Afroasiatic comparison moving away from mere superficial similarity, which often haunts early attempts at comparing any sets of languages (whether actually related or not). Bomhard’s proposed conditioning is also very interesting: before the close vowels *i and *u. Again, big if true; just plausible enough, but also unusual enough that it would just about rule out the possibility of independent innovation (regular old palatalization of velars before front vowels would be much more at risk of this). And there might be interesting follow-up questions, e.g. if this might be the primary source of some of the numerous different sibilants in Semitic, instead of them being simply projected as retentions from PAA…

But before any of that: does the base evidence hold up? On closer look, things really fare very poorly. First, how many positive examples are there at all? Out of Bomhard’s 72 etyma with initial velars (no:s 163–234 in the 2014 book), I find no more than six with palatalization happening, most of them also with other problems.

  • *kil- (#176): Egyptian ṯny ‘to raise’ ~ Gedeo kiil- ‘to weigh’. Would be rejectable as noise already due to poor semantics and abjectly poor distribution in Cushitic — an isolated word that does not reconstruct even for Highland East Cushitic, let alone wider East Cushitic. Bomhard moreover has a parenthetical “(? < *kilo-)” for Gedeo, but this looks to be a misreading. Hudson’s HEC dictionary has no asterisk here, and to me looks to be rather suggesting derivation from kilo, as in, a widespread clipping of kilogram!
  • *gib- (#199): Semitic √zbd ‘to give, gift’ ~ Eg. ḏb(3) ‘to supply, provide’. Note the absense of any proposed cognates with velars or high vowels! Bomhard has found a velar way across Nostratic instead, in PIE *gʰebʰ- ‘to give’. Obviously not safe grounds for working out Afrasian comparative phonology.
  • *gid- (#200): Sem. √gdd ‘to join, press together etc.’ ~ Eg. ḏdb ‘to gather, assemble’ ~ Hadiyya giddis- ‘to compel’, Kambaata giddis- ‘to order’. Tolerable, but note an unexplained failure of palatalization in Semitic.
  • *gin- (#201): Egyptian ḏn ‘to grind’ ~ East Chadic *gin- ‘to pound’. Tolerable, even if a binary comparison with inexact semantics.
  • *gir- (#202): Sem. √zrr, √ʔzr ‘to gird, surround’ ~ Eg. ḏry ‘to enclose’ ~ Berber *dər- ‘to hold, etc.’. Again, no velars in sight in Afrasian. Bomhard in his Nostratic book compares Dravidian *keṯ- ‘to enclose, etc.’ (with the plosive *ṯ!), PIE *gʰerdʰ- and *ǵʰer- ‘to enclose’.
  • *kʼub- (not in his lexicon but cited in the discussion of phonology): Sem. *ʔiṣbaʕ ~ Eg. ḏbʕ ~ Berber forms such as ḍaḍ ~ East Cushitic *kʼub-, all ‘finger’. The S ~ E comparison is nice and Cushitic certainly shows similarity too. I’d want to know how are we supposed to get *b > in Berber though?

Now, even if all this was taken at face value — we’d still have a failure to demonstrate any regularity for *k > *tʲ or *kʼ > *tʼʲ, or for conditioning by *u. And this becomes all the worse by Bomhard also advancing some reconstructions with *Ki- and plenty with *Ku- that fail to show palatalization:

  • *kir- (#177) ‘uppermost part’: e.g. Eg. krty ‘horns’, Tuareg takərkort ‘skull, cranium’
  • *kum- (#178): e.g. Sem. √kmr ‘to pile up’, Eg. km ‘to complete, add up to’
  • *kum- (#179): Eg. km ‘black’ ~ Gawwada kumma ‘black’ (however, likely rejectable for poor distribution; ‘black’ is unstable in Cushitic, so finding an accidental resemblance would be easy).
  • *ku(wa)n- (#180) ‘dog’: e.g. Guanche cuna (however, nothing in Sem., Eg., or the Berber languages proper, so perhaps irrelevant)
  • *gir- (#204): e.g. Sem. √grr ‘to flow, move swiftly’ ~ Ber. #ugur ‘to go, walk’ ~ Hadiyya geer– ‘to run’, Beja ʔagir– ~ ʔagar– ‘to return’ ~ West Chadic *guraʔ-, Central Ch. *gwar-, East Ch. *gVr- ‘to come’ (Evidence for *i doesn’t look good, is that based just on Beja?!)
  • *gub- (#206): Sem. *gab- ‘top, mountain’ (w/ various extensions) ~ East Cush. *gubb- id. ~ Central Ch. *guɓa id.
  • *gub- (#297): Akkadian gubbubu ‘to roast’ ~ East Cush. *gub- ‘to burn’
  • *gur- (#298): e.g. Semitic #gargar ‘throat; to gurgle’ ~ Ber. #gurz- ‘throat’
  • *kʼul- (#228): Arabic qalla ‘to raise, carry’, qulla ‘tip, apex’ ~ Ber. #ɣli- ‘to rise, ascend’ ~ Central Ch. *kul- ‘to lift’
  • *kʼum- (#229): South Sem. √ḳmħ ‘to be in despair’ ~ Eg. qm3, qmd ‘to mourn’ ~ South Cush. *kʼum- ‘to grumble’
  • *kʼutʼ- (#233): Sem. √ḳṭn ‘small’ ~ Sidamo kʼuutʼa ‘short’ ~ Central Ch. *kutʼun ‘short’

Many of these also have etymological issues or uncertainties of their own (and I’ve skipped a few cases where Bomhard reconstructs no vowel, but notes an *u being indicated by his Nostratic etymologies), but they should suffice to show Bomhard’s proposal is inconsistent already with his own data. These kind of major ideas about historical phonology cannot be just extrapolated from two or three datapoints that look nice (thus even the case of ‘finger’ might be just a chance resemblance after all), they would need systematic support from several reliable etymologies! I would retain at most the idea of Egyptian *g > before *i — a change that seems to be posited already for long by all sorts of mainstream Egyptologists too. There is, alas, no sound evidence here for setting up any kind of Semitic / Berber velar palatalizations, or any kind of a North Afrasian group.

I hope this little case study also demonstrates good reasons to think that even trying to include Afroasiatic in any kind of Nostratic or similar yet wider venture is currently very premature: what hasn’t been securely reconstructed cannot be securely compared either. Reaching down for data from shallow sub-subgroups is also not a way to build good comparisons. And even supposing we already had a decent-sized corpus of e.g. Egyptian–Semitic etymologies, the best route to progress still should be to find out which of them have solid Chadic and/or Cushitic cognates too, not skipping a step and wildly datamining comparisons with Dravidian or Indo-European or whatnot. Any clear theory of relating language families to one another should tie together all involved languages, not just a handful of them.

[1] Or rather, “North Erythraean”, a term following Ehret, whose primary division of Afrasian is into Omotic vs. an “Erythraean” rump group. Since the entire inclusion of Omotic in Afrasian remains disputed, a better choice would probably have been to keep the name “Afrasian” for the latter and instead invent some other name for the full entity (that could be also easily discarded if the inclusion of Omotic turned out to not work after all).
[2] In his chapter on Egyptian in the 1971 Mouton handbook Afroasiatic: A Survey (ed. Carleton T. Hodge); Vergote further refers there to his 1945 book on the historical phonology of Egyptian.

Tagged with: , , , ,
Posted in Commentary, Reconstruction

Junk phonemes in Proto-South-Cushitic, and some possible fixes

Followup on my previous overview of comparative Cushitic: a slightly more involved look at Ehret’s Proto-South-Cushitic from 1980, and some readily observable issues in it. To reiterate slightly, his view of South Cushitic includes four basic units:

  • West Rift, a generally-accepted cluster comprising three or four languages (Iraqw–Gorowa, Alagwa, Burunge);
  • East Rift = two recently extinct-or-moribund languages (Kwʼadza and Aasax / Asa);
  • Ma’a, still treated by Ehret as a Bantuized Cushitic language;
  • Dahalo.

There’s a couple easily observable typological features that all four share, maybe most prominently

  1. the presence of labialized consonants; in most just dorsal consonants like /kʷ gʷ qʷ ŋʷ/, in Dahalo also a couple labialized coronals like /dʷ ɬʷ/;
  2. the presence of lateral obstruents; fricative /ɬ/ in all, also an ejective affricate /tɬʼ/ in most with the exception of Aasax and Ma’a, in some descriptions of Dahalo further also voiced /dɮ/ and/or palatals /cʎ̥ʼ ʎ̥/.

These are not a priori guaranteed to be innovative though — those reconstructed by Ehret for PSC he indeed goes on to later treat as Proto-Cushitic (and even Proto-Afrasian) archaisms, and his argument for the genealogical unity of South Cushitic is more involved, for one part of which see later. It might also seem premature for him to have focused on the “full” South Cushitic instead of just the clear Rift subgroup, but that’s what we do have and can therefore review.


Newer research has already put quite a bit more work into the comparison of the still more or less thriving West Rift languages (Kießling & Mous 2003, The Lexical Reconstruction of Proto-West Rift), and it seems like combining this with Ehret’s work could make a productive project. As noted in my previous post, there’s been also a fair bit of debate in the literature on what to do with Dahalo. Most of the discussion that I’ve seen, though, is unfortunately sort of typology-oriented and relies on methods like looking where cognates might be found and which of them look the most surface-similar to Dahalo. But even if we suppose the language is e.g. ultimately instead East Cushitic, there is no rule saying that sometimes e.g. a proper Proto-Cushitic etymon could not have survived just in, or mainly in, Dahalo + Rift, instead of being something like a Rift loanword in Dahalo. More detailed work on its historical phonology might be able to sometimes make this distinction, and Ehret’s proposals on this should not be wholly ignored. That Dahalo’s consonant system is one of the most “kitchensinky” on the planet (it has a little bit of almost anything you could ask for: clicks, ejectives, implosives, prenasalized consonants, a dental/alveolar distinction…) should surely also help: clearly it has been absorbing loanword phonemes for a while now, and then perhaps not only from unrelated Bantu or extinct Paleoafrican languages, but also from other branches of Cushitic? Ehret’s work already leaves a few clear openings for this kind of a hypothesis, I think. More on this further along.

The situation of research on Kwʼadza, Aasax and Ma’a looks much less satisfying. As for East Rift, R. Kießling, one of the more active West Rift researchers, seems to have deemed their closest eastern relatives not worth looking into, with claims appearing in overview works such as “The position of Qwadza and Asax is dubious, since there is not enough data (and probably never will be) to prove that they belong to a different subbranch within Southern Cushitic“. [1] I have not found substantiation for this claim of “not enough data” (how would one prove such a negative anyway?), and that Ehret already has given in his work reasons to think that they form a distinct East Rift branch looks to be simply swept under the carpet. Any idea that a language could not be even classified if it is extinct and its available documention is short of state-of-the-art modern linguistic methodology is, of course, absurd. Historical linguistics can demonstrate just fine that languages such as Oscan or Umbrian are not merely Indo-European, they’re indeed Italic and further form a separate subgroup of it in contrast to the well-attested Latin. This is no folly or privilege or just Indo-Europeanists either, the same has been done again just fine e.g. with plenty of extinct poorly attested Semitic languages (Ammonite, Edomite, Samalic, Ugaritic, all of Ṣayhadic / Old South Arabian…), further all sorts of extinct poorly attested Algonquian or Tupian or Uto-Aztecan languages, etc etc.

Even single wordlists can yield fair evidence for a detailed classification, if a language is not too far removed from well-attested relatives, and can be thereby linked with the framework of historical phonology, lexicon and, perhaps, morphology that they allow setting up. For a Uralic example, consider Yurats (and I could cite also plenty of cases discussing the detailed dialectological positions of Finnic, Samic, Mordvinic, Mari, Mansi etc. varieties known again only from single wordlists). This is more or less what Ehret does too, even if the work is kind of buried within his corpus of South Cushitic etymologies and his overarching South Cushitic phonological reconstruction. Clearly missing though is any synthesis of the different sources of Kwʼadza and Aasax; the former known from five or six primary collections, the latter from three, many of both unpublished. Work on this might be desirable also for helping with steering people like Kießling away from outright denying the studyability of East Rift. [2] And really not even any more humble general description of either variety seems to have been published at all! Ehret, too, is mostly content to simply assert overall phonological inventories, without commenting much on the primary sources. This might be fair in the case of Kwʼadza, since he has himself conducted fieldwork with its last speakers (and there seems to be an unpublished manuscript from this that’s cited in various later work; even on Wikipedia amusingly enough), but less so with Aasax. Ehret passingly notes e.g. having converted the last field records from 1974 from phonetic into phonological transcription, but one would like to know some details on this too.

It might be less clear if East Rift is really entirely a sister group of West Rift, or just a divergent member, or even, some kind of an areal within it. At least one of the more distinctive common West Rift innovations that Ehret proposes, the shift of the ejectives *kʼ, *kʼʷ to uvulars *q⁽ʼ⁾, *q⁽ʼ⁾ʷ, is alas trivial within Cushitic, appearing in several other groups or languages including Agaw, Somaloid, Konsoid (here as an implosive [ʛ]!); and maybe indicating that some stage of Common Cushitic did not quite have *kʼ, but rather something slightly different such as an also-pharyngealized *kˤʼ. [3] The same holds also for one of the more immediately obvious common East Rift innovations, the merger of pharyngeals *ħ *ʕ with glottals *h *ʔ respectively, which is again attested in a large number of other basic groups, e.g. Agaw, Oromoid, Highland East, and per Ehret indeed also in Ma’a. But also many minor conditional developments are posited for both West Rift and East Rift. These would require actual review rather than blanket dismissal.

Several yet further differences between WR and ER involve weakly attested sound correspondences that Ehret simply reconstructs as additional Proto-Rift segments. Some do involve specific attestable segments in ER (e.g. WR *d ~ ER prenasalized *nd < Ehret’s Proto-Rift dental *d̪, oddly enough); some others, just “crossed” correspondences between more basic segments. This is where we start getting into clearly dubious territory. These segments do not really flesh out any highly sensible phonological subsystem. Some of them could be fitted into “empty slots” seen in the West Rift and East Rift consonant systems, but only with further assumptions about their development: e.g. a correspondence WR *b ~ ER *p is reconstructed by Ehret as Proto-Rift *pʼ, [4] even though there are no other examples of either ejective voicing in WR or ejective devoicing in ER.

Still, before continuing this thread, a few words also on Ma’a. For this variety we already have clear criticism of Ehret’s position of treating it as a third basic branch of South Cushitic: Mous 1996, “Was there ever a Southern Cushitic Language (Pre-)Ma’a?” The most basic argument, which I take no issue with, is that attested Ma’a should not be itself even treated as a Cushitic language, but merely a largely-Cushitic lexical register of a Bantu language otherwise known as Mbugu. This then opens the option that Ma’a might not originate by language shift from some highly distinct Cushitic variety, but rather, from adoption of Cushitic vocabulary from at least two sources, in his view one of them probably a member of West Rift, the other closer resembling Oromo. So far, so good; Mous even admits that language shift from Cushitic still remains on the table (though he thinks this pre-Ma’a to have been more probably East Cushitic), which to me at least would still sound like a compelling reason for why modern Ma’a-as-a-register arose at all. However, other problems remain. As in research on Dahalo, Mous too seems to deem various lexemes to be either “West Rift” or “East Cushitic” mainly by their etymological distribution, without detailed attention to comparative phonology. For a simple example, West Rift /ħ ʕ/ correspond with Ma’a /h ʔ/. While this could arise as a sound substitution by Bantu speakers unfamiliar with pharyngeals (but would they not have been also unfamiliar with the glottal stop at least?), it could also indicate borrowing, instead, from East Rift, where as mentioned, loss of pharyngeals appears natively; at least if we admit that things about the East Rift languages are knowable. Even the geographically closest certainly-Cushitic language to Ma’a is indeed Aasax! (None are in direct contact with it.) For that matter, at least a few broad but specifically Ma’a–Aasax isoglosses seem to be proposed in Ehret’s work too: *x > h and *tsʼ > s in at least some cases, again simple enough to be plausibly just sound substitutions by foreign speakers, but plausibly also real common innovations, especially if we no longer require that these should be reflected everywhere in the Cushitic component of Ma’a.

Furthermore, all this is complicated also by the proposals in literature that Dahalo, too, has ended up with lexicon of both South (≈ Rift) and East Cushitic origin. Cushitic vocabulary in Ma’a may not force the existence of a single substratal pre-Ma’a as an independent South Cushitic branch, but it definitely forces the existence of at least a Cushitic language in contact with older Mbugu; if two different Cushitic lexical strata are accepted, then at least two such contact languages. We must then still ask: what was the internal history of this / these Cushitic varieties? Already the geographic separation of Ma’a and Rift has implications for this. Even if we supposed there was simply an originally Rift variety that wandered further towards the coast along the Pangani river, this does not rule out a possibility that the “East Cushitic” component surfacing today was not borrowed independently into Bantu Ma’a, but rather, already into this Rift variety — just as the proposed two-stratum theory of Dahalo would also require the existence of South / East Cushitic language contacts. (Trivial if Dahalo is really South Cushitic, it’s already right next to southern Somali and Oromo(id) varieties, but less so if it’s supposed to be “just another” East Cushitic branch.) The opposite scenario can be considered as well: a lost East Cushitic language in the area, which had at some earlier point in history absorbed also some Rift influence, before itself contributing a chunk of vocabulary into Ma’a.

Even moreover, the general problem that “East Cushitic” remains without a good definition by shared innovations (and might even contain all of the putatively South Cushitic languages in it) keeps on the table also the option that some of the Ma’a vocabulary is not narrower East Cushitic-isms, as much as archaisms, lost from the attested Rift languages. This holds even if Ma’a is indeed analyzed to contain vocabulary from two different Cushitic sources! After all, it is not very economical to assume two completely separate Cushitic spreads south into Tanzania, only for one of them to then disappear completely except for a few loanwords into Ma’a. Instead, it would make geographical sense for the “East Cushitic” component to be still really para-Rift, as per the family tree followed by Ehret. Something of this sort is also readily suggested by theories of East African prehistory which posit the Rift languages and maybe Dahalo (if it’s been in Kenya longer than its immediate “East Cushitic” neighbors) to not represent any kind of an outpost of Cushitic, as much as a remnant of a continuous Cushitic belt that would’ve once, before the newer expansions of South and East Nilotic and Northeast Bantu, stretched all across the areas of modern Kenya and northern Tanzania.

All this leaves a large number of “moving parts” available for any reanalysis of the historical phonology of Ma’a. As we will see, various reanalyses are probably required regardless; but it also seems to me Mous’ idea of mixture of basically modern West Rift and modern East Cushitic is too simplistic and, above all, geographically implausible in an environment where nothing West Rift nor classically East Cushitic has been attested. Prehistory hides many lost languages, and there is nothing a priori implausible in proposing one where evidence so suggests.


As I’ve mentioned recently on Twxttxr, any review of Ehret’s phonological scheme of South Cushitic should probably begin from the reconstruction of Proto-West Rift. There is fairly good overall agreement between Ehret’s reconstruction and the later work of Kießling & Mous (henceforth K&M), to be expected since also the modern languages retain the makeup of the system almost intact. PWR comes out with a reasonably distinctive system containing at least:

  • all six basic stops *p *b *t *d *k *g, plus labiovelars *kʷ *gʷ and uvulars *q *qʷ;
  • two ejective affricates *tsʼ *tɬʼ, interestingly without voiceless or voiced equivalents;
  • an almost full system of voiceless fricatives, *f *s *ɬ *x *xʷ;
  • the full original Cushitic (perhaps already original AA?) laryngeal system, *ħ *ʕ *h *ʔ;
  • all six basic sonorants *m *n *r *l *w *y, plus palatal and velar nasals, *nʲ *ŋ;
  • a bog standard Cushitic vowel system *a *e *i *o *u.

K&M add to this the labiovelar nasal *ŋʷ (actually mostly corresponding to Ehret’s *ŋ), vowel length, and, appearing mainly in clear loanwords, postalveolar affricates *č *ǰ. Ehret adds ejective *čʼ, supposedly distinguished from *tsʼ only in older Alagwa records (but see below). As it turns out, comparison already with Ehret’s own Proto-Cushitic reconstruction shows that most of these segments can be easily equated with identical precedessors also there — only *nʲ and K&M’s *č *ǰ seem to be entirely novel. A few call for other notes, but give no reason to doubt their PWR or Proto-Rift existence:

  • PWR *q, *qʷ: as noted above, clearly from older velar ejectives *kʼ, *kʼʷ.
  • PWR *tsʼ corresponds most prominently with Ehret’s PC *tʼ, suggesting spontaneous affrication as the original explanation of the phonological asymmetry of /t tsʼ/ without **tʼ **ts in West Rift (or East Rift). The same occurs in the neighboring Sandawe, and a partly similar /t tsʼ ts/ without **tʼ in the neighboring Hadza (both of them “Khoisan” candidate isolates, presumably ancient in the region), suggesting that this has been an areal innovation, arising on-site in the Rift Valley, i.e. at least not at any extremely early time during the Cushitic expansion southwards.
  • PWR *tɬʼ: various distinctive correspondences identified by Ehret, enough of them that there was probably a distinct PC precedent, even if not necessarily a lateral obstruent (I’ve heard of some recent work suggesting secondary lateralization of earlier palatals).
  • PWR *x, *xʷ supposedly correspond with both *x, *xʷ and *ɣ, *ɣʷ in Agaw, but also stops in various other parts of Cushitic (most consistently in Beja). Probably these are real inherited segments too, but I would wonder about options like reconstructing “old” PC uvulars instead, later then either fricativized or merged with velars.
  • The labialization contrast usually goes with /o/, /u/ vocalism elsewhere in Cushitic and might be secondary, especially if East Cushitic does not hold up as a subgroup; Ehret posits its most prominent innovation to be *Cʷa > *Co, *Cʷ > *C elsewhere, but perhaps this is archaic rather than innovative. Labiovelars occur also in Beja and Agaw, but they might have been independently innovated, especially since neither is an especially old group by itself. Same might go for labiovelars in Dahalo and Ma’a (if we don’t think they simply get all their “South Cushitisms” thru loanwords from Rift proper). Clearly needs further research though.
  • PWR *p: compared with /p/ also in East Rift, Ma’a and Dahalo, probably correctly. From the rest of Cushitic, Ehret however mostly finds comparanda with /b/. Already on general typological grounds I suspect these are mainly dubious, and that original Proto-Cushitic *p was instead shifted to *f early and just about everywhere, leaving a **p gap in most of the Cushitic languages. A new *p then would have arisen at some point in the development towards Proto-Rift. If this is per hypothesis mostly newer areal vocabulary, appearence of /p/ also in Ma’a and Dahalo probably won’t suffice as a defining PSC feature though.

After this things start getting worse. Already the bare numbers suggest bloat in Ehret’s deeper phonological reconstructions: his system of 29 consonants in Proto-West Rift expands to 33 in Proto-East Rift, 36 in Proto-Rift, and finally balloons to 49 in Proto-South Cushitic. A priori this might not be a completely terrible amount, when compared with an impressive 60+ in Dahalo, where many of these find unique reflexes… though then just 30-ish in any other South Cushitic variety. More alarming is that even Ehret himself finds no Proto-Cushitic source for most of these additional segments. This could be all still OK, maybe open to various kind of reanalyses, if these reconstructions were based on good robust data. Alas, they are not. Most are based on few etymologies, often with semantic stretches or other irregularities. Often also with major distributional gaps, such that an asserted overall correspondence pattern really comprises e.g. individual Rift ~ Ma’a and Rift ~ Dahalo correspondences lumped together (or perhaps even weaker correspondences like Kwʼadza ~ Dahalo, West Rift ~ Ma’a, etc.), with no or almost no evidence of the implied Ma’a ~ Dahalo correspondences even existing. — This strategy of “farming” or “lumping” rarer proto-segments from disjoint correspondences probably needs a name for it, I keep seeing it in many long-range or otherwise dubious reconstruction proposals; e.g. it’s all over the place in versions of Nostratic. (Cf. also footnote 4.)

The most distinctively poor set of Ehret’s extra segments are prenasalized stops and affricates in word-initial position. (Word-medial cases do not look distinguishable from plain old nasal + stop clusters, well-attested thruout Cushitic. I’ll also skip over *nɬ, which does not yield anything prenasalized and which Ehret in his later Proto-Cushitic work readjusts to *dɮ.) These are a regular but small part of the phonology of Dahalo, which should be probably assumed to mainly originate as intrusive vocabulary, maybe some also from irregular nasalization of former plain stops. And almost all of the South Cushitic etymologies Ehret finds for them are weaksauce. To roll out the data as he cites it (abbreviations: I = Iraqw, B = Burunge, A = Alagwa, Q = Kwʼadza, S = Aasax, M = Ma’a, D = Dahalo; transcription should be mostly obvious but I maintain Ehret’s for dental stops in D):

  • *mpats- ‘to be strewn’: A pasit- ‘to scatter’, pisari ‘seed’, B pisagariya ‘seed’ ~ D mbàttsì ‘potsherd’. Poor semantics, and also *ts is a suspicious PSC segment.
  • *mparoxʷ- ‘egret’: Q palaʔeto ‘crested crane’ (-l- < *-r- is regular) ~ D mbórogo ‘young egret’. Poor distribution, uncompelling semantics (mixing different species’ names is always an easy way of farming junk etymologies), apparently irregular *xʷ > ʔ in Q, and even supposedly regular *xʷ > D g sounds phonetically suspicious.
  • *mpee- ‘little, mean, scanty, slight’: Q paʔali- ‘narrow’ ~ M -bí ‘to shorten’ ~ D mbííṯ- ‘to scorn’ (→ mbííṯe ‘bad’). Very short CV comparison, semantics at least in D too divergent to put any trust on.
  • *mpuux- ‘sprout, shoot’: M -buká, -buxá ‘greens’ ~ D mbùùku ‘vine, tendril, creeper’. Poor distribution and semantics.
  • *mpɨnde-: M -púnde ‘penis’ ~ D mbéne ‘vagina’. Poor distribution, uncompelling semantics (possible, but not strong evidence by itself to believe in the comparison).
  • *ntaakʷ- ‘small carnivore’: I taweramo, B takoraymo, A tokoraymo (K&M: PWR *takʷerimo) ‘wild dog’ ~ D nḏááge ‘aardvark’. Poor semantics, irregular voicing and delabialization of *kʷ in D.
  • *nteekʼʷ- ‘incisor’: I taqesamo ‘jaw’ ~ D nḏéégi ‘canine tooth’. Poor distribution, uncompelling semantics, irregular voicing of *kʼ⁽ʷ⁾ in D (and why is the labialization reconstructed at all?)
  • *nʈarag- ‘Orthoptera species’: Q tsʼelemayo ‘cricket’ ~ D nḏàràgì ‘mantis’. Poor distribution and semantics, ad hoc metathesis of a presumed suffixal *-m- in Q.
  • *nʈaŋa ‘beestings’: Q tsʼangayiko ‘fresh milk’ ~ M dáŋá ‘beestings’. Limited distribution; semantics fine; no direct evidence of prenasalization though.
  • *nʈif- ‘food stirring stick’: I tsʼifraŋ, B čʼufara, A tsʼufara, S šeferank ‘tongue’ (K&M: PWR *tsʼufiraaŋʷ; *tsʼ- > š- in Aasax is regular) ~ D nḏufuro [‘food stirring stick’?]. Uncompelling semantics, even if this is likely an innovative lexeme in Rift compared to the rest of Cushitic.
  • *nʈoh- ‘to clear the throat’, *nʈoh-aala ‘phlegm’: B čʼohod- ‘to cough’ ~ Q tsʼalahet- ‘to curse’ ~ D nḏwààlà ‘mucus’. Uncompelling semantics in Q (requires also metathesis); irregular contraction *-ohaa- > -waa- in D; plausibly simply a recent onomatopoetic verb in B.
  • *nʈuu- ‘hawk’: S šuʔununu ~ D nḏúúma. Poor distribution; very short comparison, requires ad hoc morphology. Semantics apparently fine.
  • *ntsaaw- ‘reeds’, *ntsoomari ‘straw’: I tsʼawo ‘reeds’ (K&S: also B A, PWR *tsʼaaboo ‘sisal, bushy end’) ~ Q tsʼemaliko ‘straw’ ~ M izumari ‘flute’. Vocalism problems in Q, uncompelling semantics in M, no direct evidence of prenasalization.
  • *ntsew- ‘small bird sp.’: M -zewe ‘carmine bee-eater’ ~ D ndzòmò ‘barbet sp.’ Poor distribution, uncompelling semantics.
  • *ntsi- ‘spleen’: I tsʼi-daʕa ‘heartburn’ ~ Q tsʼiyale ‘spleen’ ~ D ndzóne ‘spleen’. Very short comparison with ad hoc morphology, irregular vocalism in D, poor semantics in I.
  • *ntsom- ‘to shout’: Q tsʼamaʔato ‘happiness, joy’ ~ M -zo ‘to cry’. Poor distribution and semantics.
  • *ntsoom- ‘kind of bee’: Q tsʼamayituko ‘bee’ ~ D ndzóóme ‘honey of ḿpeele bee’. Poor distribution, Q morphology unexplained.
  • *ntʲaduʕ- ‘bog’: M -darú ‘swamp’ ~ D ndodóʕo ‘mud’. Poor distribution, uncompelling semantics.
  • *ntʲodi- ‘grasp, grip’: M -dóri ‘to take; marry (a wife)’, -doríwe ‘to be married’ ~ D ndódi ‘thumb’. Poor distribution and semantics.
  • *ntʲooʕ- ‘gravelly soil’: B čʼiʕaramo ‘pebble’ ~ Q čʼaʔamuko ‘small streambed’ ~ D ndóóʕo ‘sand’. No immediate major flaws, still not an obvious etymology either though.
  • *ŋkara-: Q kalaʔeto ‘stork’ ~ D ŋgára ‘crested crane’. Poor distribution, uncompelling semantics.
  • *ŋkexine- ‘eyebrow’: I gine ~ D ŋgikine. Looks decent except for loss of *-x- (or indeed maybe of *-k-) in I. K&M have instead two different PWR etyma for ‘eyebrow, eyelid’, neither certain to be old inheritance though.
  • *ŋko- ‘flea’: Q koyimaye ~ D ŋgúnewe ‘spirillum tick’. Poor distribution, uncompelling semantics, short comparison, ad hoc morphology. Maybe the worst etymology here, despite stiff competition!
  • *ŋkol- ‘steer’: I B A karama ‘bull, steer’, Q kolawatu ‘bull’ (K&M: PWR *karaama) ~ D ŋgólome ‘bull buffalo’. Would look decent except for a l/r mismatch; also with parallels in East Cushitic that again have just plain *k- and also *-r- rather than *-l-, e.g. Borana (Oromoid) korma ‘bull’.
  • *ŋkum- ‘fog’: M -gónónó ~ D ŋgúmine ‘raincloud’. Poor distribution, ad hoc *u > o and *m > n required in M; semantics fine.
  • *ŋkʷaa- ‘rainbow’: B ilakʷekʷiya ~ D ŋgòòwi. Extensive ad hoc morphology (including reduplication) required in B; semantics fine.
  • *ŋkʷaal- ‘to impoverish, leave poor’: I kʷalaʔo, B A kʷaʔalitoʔo, Q kalaʔay ‘widow’ (K&M: PWR *kʷaʔalaʔoo) ~ M -gwa ‘to steal’, -gwaló ‘thief’. Poor semantics, and Ehret’s assumption of original *l plus metathesis in B A might not hold. No direct evidence for prenasalization.

A few of these maybe might still be at least related areal vocabulary, but it should be clear that the low number of comparanda, both overall and especially in the otherwise well-documented West Rift, reliance on semantically off-field comparisons, and a common need for additional morphological or phonological assumptions, does not add up to a corpus supporting the existence of an already Proto-South Cushitic consonant series. E.g. the example of ‘bull’ could instead suggest a borrowing that is ultimately of Cushitic origin but was passed through some other intermediates before getting to Dahalo. Ehret also has one other similar case just in Dahalo and so not formally reconstructible for PSC: D ŋgaasið- ~ Somali kas- ‘to explain’ (besides prenasalization, vowel length also not matching; also s ~ s cannot be native).

There should be also a general suspicion of anything with prenasalized stops more likely coming from Bantu. Looking over Mauro Tosco’s 1991 A Grammatical Sketch of Dahalo, however (which includes a glossary with some loanword notes), fairly few cases of that have been identified: from Swahili there’s mbona ‘why’, nḏuugo ‘kinsman’, ŋgúúfu ‘strong’; from Northern Swahili nḏani ‘inside’, nḏigad- ‘to bury’, nḏoo ‘come!’ and tentatively nḏupa ‘bottle’, ŋgúúko ‘cock’ (~ NSw thupa, khuku). [5] Nothing from other nearby Bantu languages like Pokomo, but maybe that simply has not been (was not?) studied yet. But note also Dahalo’s nasalized dental click , which might also count as “prenasalized” phonologically, but almost surely can’t originate as is from anything Bantu.


It would be possible to go over similar problems in base etymological data also for some of Ehret’s other more poorly attested segments. As my tally of prenasalized consonants already shows, he in particular adds a few additional place of articulation series for PSC, which aren’t really reflected as such anywhere: retroflexes and palatalized dentals, most of them scantily attested and probably rejectable entirely. He proposes PC origin for two of them though: the voiced retroflex *ɖ and the palatalized ejective *tʲʼ. These show different issues, maybe worth discussing in more detail.

If taken at face value, Ehret’s PSC *tʲʼ probably should be first of all reconstructed instead as a postalveolar affricate *čʼ; since that is both his alleged PC source and its Proto-Rift reflex. Also the asserted Ma’a reflex is č, but most data for that are again poor etymologies, e.g. M -čá ‘to be crafty’ is compared with Q salimuko ‘coward’, D tʼar- ‘to practice witchcraft’; M hečéri ‘yet, not yet’ is first analyzed to have a prefix he- continuing a fossilized demonstrative, then compared with Q sel- ‘to straighten’, to gain a PSC root supposedly having meant ‘to make ready, prepare, put in order’. A few comparisons between Rift and Dahalo look better, e.g. *tʲʼatʼ- ‘soil, earth’ > B čʼečʼeʔiya, A tsʼatsʼaʔi ‘dust’ (K&M: PWR *tsʼatsʼaiʔya) ~ Q saʔamuko ‘earth’ ~ D tʼattʼe ‘mud’. But as the Dahalo reflex is just the alveolar ejective , it also turns out that most of the data would fit simply as cases of Ehret’s plain *tʼ! He relies most often on Kwʼadza data on making the distinction, where supposedly *tʲʼ > s (as in all three examples above) — but these cases generally leave again room for doubts about their validity, e.g. irregular *tʼ > ʔ in ‘earth’ above. A few also have Ehret’s PWR *čʼ on the grounds of čʼ in older Alagwa data, but these are firstly few, and secondly, they mostly occur in a palatal environment (e.g. čʼiraʔa ‘bird’; K&S’s PWR *tsʼiraʔa) where we might suspect this was actually just a dialectalism or a lost allophonic feature. Hard to tell though from the current presentation. Ehret still gives also words like PSC *tʼah- ‘to be pregnant’ >> A tsʼihay ‘pregnancy’, but does not state if this comes from older or newer Alagwa data. Either way, Ehret’s supposed distinction *tʼ | *tʲʼ looks like it has been really “farmed” together from disparate sources, most prominently

  • older Alagwa tsʼ | čʼ;
  • Kwʼadza tsʼ | s;
  • Ma’a s | č;

that actually do not correlate well with each other. I’ve already tentatively suggested that the supposed Ma’a cognates with č are just wrong, and that older Alagwa čʼ might be secondary. For Kwʼadza I’m not sure if either of these approaches works entirely. Not all data with s looks easily dismissable, e.g. PWR (K&M) *tsʼaaʔas- ‘to shine, shed light on’ (probably with the common Cushitic causative *-as- suffix) ~ Q saʔ- ‘to burn’; PWR *tsʼitsʼaʕiya ~ Q sasaʔamo ‘star’. [6] A third option though might be internal loaning: in Aasax, the regular word-initial reflex of Proto-Rift *tsʼ is a sibilant š, and this would probably be reflected as s (Q has itself no š) if some words had been borrowed into Kwʼadza from an Aasax-like variety. — Ehret’s later proposed distinct Proto-Cushitic sources also do not look very strongly established, but that would be more of a tangent that I want to get into in detail; though again, the same general types of problems recur as in Ehret’s PSC reconstruction.

As the last stretch for this blog post, Ehret’s PSC *ɖ proves to be relevant in several ways. Word-initially, this is proposed to be distinct from plain *d in that Dahalo would have an implosive ɗ for the former, a dental plosive for the latter; Ma’a would have mostly ɗ- from both, but sometimes also some *ɖ- > z-. A few do look plausible (e.g. I A deʔem-, B Q deʔ-, M -zéʔu ‘to herd’). All of Rift, however, shows just *d- (retained in I B A Q, > ɗ in S). Furthermore, comparison with Proto-Cushitic supposedly shows *ɖ- < *d- versus *d- < *z-. Both correspondences indeed have some decent etymologies for them, e.g. PC *dar- ‘to increase, add to’ > D ɗar- (Ehret only lists other cognates from Agaw and Somali; I’d consider adding also Highland East *darš- ‘to swell’); PC *zab- ‘to grasp’ (in Beja, Somaloid) > D ḏáβa ‘hand’. (Both of these also well represented in Rift languages.)

In his Proto-Cushitic book, Ehret claims that this chain shift of *d and *z would be evidence for the unity of South Cushitic. The fortition *z > *d is distinctive at first sight, but this sound change is again widespread in Cushitic, e.g. Beja, Saho–Afar, Oromoid, Konsoid; it may have started already at an early date, perhaps diffusing to Pre-Proto-Rift already from some neighboring East Cushitic dialect area. Thus, if Rift does not even show the distinction of Ehret’s *ɖ- and *d-, to me this supposed isogloss seems worthless. A real chainshift can be only really set up for Dahalo, and also without any shunts in place of articulation. The language probably has *d- >> ɗ- simply as a part of an areal innovation of implosion of voiced stops, appearing already also in Aasax (only ɓ- ɗ-) and Ma’a (all of ɓ- ɗ- ɠ- ɠʷ-), as well as in the local Bantu languages, including Swahili. Initial P(S)C *b-, too, gives Dahalo ɓ-. [7] The most likely reason for why Proto-Cushitic *z was not affected would be that it still remained a fricative at this point, if maybe already [ð] — which still exists in Dahalo as the intervocalic allophone of /d̪/ — and only fortiting to a stop ḏ- later: thus, in particular, independently of the similar change in Rift.

Besides implications on classification, another corollary is that if the distinction between Ehret’s *ɖ and *d was in the last common ancestor of Rift and Dahalo rather in manner and not place of articulation, there is little reason to expect the existence of Ehret’s other, more poorly evidenced retroflexes *ʈ, *ʈʼ (or *nʈ, which I believe I’ve already demonstrated above to be spurious).

The Ma’a initial correspondences, then, may have been assigned the wrong way around. It seems to me that we should suspect z- reflexes to be archaisms continuing PC *z- and not *d-. Most data could be in fact swapped around with ease, since Ehret has very little evidence for the correspondence M z– ~ D ɗ-; the only decent-looking case is I daʔ– ‘to penetrate’, A daʕ- ‘to thrust into’, D ɗaʕ- ‘to insert’ ~ M zaʔá ‘inside’ (apparently a native Cushitic root further cognate with Beja da- ‘to enter’). Another example, I daqaw- ‘to go’, D ɗakʷ- ‘to be going’ ~ M -zuxu ‘sandal’, is semantically divergent enough to not be immediately reliable. And if we allow for the existence of a few Rift or para-Rift loanwords in Dahalo, these could be chalked as examples of that. I can even note from Dahalo the preposition ḏa ‘in’, which could be taken as the real native reflex of PC *za(ʕ)- ‘inside’! There also might be other explicit evidence that Ma’a z- originates from PC *z-: the above-mentioned -zéʔu ‘to herd’ looks comparable with Highland East Cushitic *zoh- or *zoʔ- ‘to roam, wander’ (whence Hadiyya doʔ-, Kambaata zoh-, Sidaamo do-). Ehret’s proposed correspondence *d- > M ɗ- ~ D ḏ- can be also dealt with similarly. Most data is weak, but at least Q daʔas- ‘to scoop into fingers (e.g. porridge)’ ~ M -ɗaʔá ‘to pick, pluck’ ~ D ḏaʕaað- ‘to catch hold of’ looks like a decent comparison, but we could however suggest that this is simply a Rift-type word in Ma’a. The absense of data showing the opposite correspondences, M z- ~ D ḏ- and M D ɗ-, will have to remain a weakness; but not a major one, if we will not claim the two to be relatively close relatives within a South Cushitic group.

Word-medially, Ehret’s *-ɖ- behaves differently. It is now supposed to reflect both of PC *-d- and *-z-; and to yield in Ma’a -ɗ- or -r-, in Dahalo -ɗ- or -ṯṯ- (yes, dental!). In just one enviroment, at the end of noun stems, *-z- and now also *-s- are supposed to instead give *-d-, yielding Ma’a -r-, Dahalo -ḏ- or -r-. Most of Rift has *-r- for both, with the exception of Burunge, where *-ɖ- > -r-, but *-d- > -d-. Looking up the full data behind any of this would be more work than any survey of word-initial correspondences (though would not need to be done from scratch: Ehret’s PSC lexicon helpfully lists also non-initial occurrences of each consonant). Given from earlier examples that Ehret is maybe particularly prone to bad etymologizing with Ma’a, and that variable reflexes there could have a complex background involving Cushitic-internal loaning, I will simply ignore it for now to save my efforts. Exclusive Rift ~ Dahalo vocabulary will also not be the most interesting here. However, if Ehret’s argument for the unity of South Cushitic from alleged common development of *d- and *z- does not hold, will word-medial evidence work any better? Again there is no evidence for a common shift in place of articulation at least. Cases of Dahalo -ḏ- [-ð-] from earlier *-z- would not need to have ever gone thru a stop at all; cases of -ɗ- might be again independent development from plain *-d-. The supposed development of *-s- to *-z- is a bit less trivial — maybe not in a general typological light, but any medial voicing of fricatives seems rare across Cushitic. And there’s again indeed evidence decent enough to think that this is a real correspondence, e.g. Somali gus ‘penis’ ~ PWR (K&M) *gu(d)doo ‘testicles’ ~ D giḏḏa ‘semen’; Somali ħaas ‘wife; family’ ~ PWR (K&M) *hadee ‘wife’ (for which Ehret reports instead reflexes with ħ-). This is complicated, however, by Ehret finding that in Dahalo also verb-stem-final *-s- is voiced to [-ð-] (and *-f- to [-β-]), which changes do not appear in Rift (cf. e.g. the example of Q daʔas- ~ D ḏaʕaað- above). If the conditions of fricative voicing in fact differ, this innovation thus seems to be at most areal in nature, not inherited from common PSC. Also, the cases in noun stems could be even interpreted differently: as really devoicing of PC *z in original word-final position in a few languages like Somali — i.e. not even an innovation at all! Thus, still no particularly clear evidence here to set up a South Cushitic (Rift–Dahalo) group independent of wider East Cushitic or general Cushitic.

Many other issues remain that I’ve not even touched so far (e.g. Ehret’s PSC central vowels *ɨ, *ə which have no individually distinct reflexes in any of the languages). But I hope to have demonstrated that a better understanding of the South Cushitic hypothesis, and the history of its constituent languages, requires 1. actually engaging with Ehret’s PSC and PC reconstructions, preferrably to some extent also with the spottily documented Kwʼadza, Aasax and Ma’a; 2. being regardless prepared to throw out plenty of weak etymologies in the process; 3. not taking older literature’s assumptions about the history, reconstruction or existence of “East Cushitic” for granted either. A big wide playing field… but probably not insurmountable.


Postscript. I’ve recalled I have around also one further work from Ehret on overall Cushitic reconstruction: a 2008 article “The primary branches of Cushitic: Seriating the diagnostic sound change rules”. [8] He has in this come to accept a few similar conclusions on SC reconstruction as I do here — he retracts the reconstruction of the distinct prenasalized stops (though mostly hangs on to the etymologies, claiming that they instead arise from the reduction of some semantically unspecified prefix *(h)in-), as well as the weaker retroflex and palatalized stops (*ʈ, *ʈʼ, *tʲ, *dʲ); former *ɖ adjusted now to *ɗ and *tʲʼ indeed to *čʼ. He still insists on what I see here as a just-Dahalo chainshift of *d *z to constitute a defining feature of South Cushitic though, with no word on the Rift merger. Also, if the most obvious junk phoneme issues have already been admitted, it might have been a good idea to move to further issues, such as Ehret’s non-ejective affricates *ts *dz *dɮ. All three attested phonemes in Dahalo, but Rift and Ma’a correspondences look more problematic.

His delineation there of East Cushitic looks dubious as well. The proposed defining features include e.g. the rise of a substantial implosive series: *pʼ > *ɓ, *dɮ > *ɗ, *tɬʼ > *ʄ, *ɣ⁽ʷ⁾ > *ɠ⁽ʷ⁾… clearly nonsense, when several putative EC languages / basic subgroups actually have implosive reflexes for at most one or two of these. But this was not a post for reviewing EC reconstruction; that would be a different topic entirely, a bigger one that has had several more people working on it too.

[1]Some salient features of Southern Cushitic (Common West Rift)“. Unclear to me from the academia.edu page where, or even if, this is published.
[2] Ultimately I suspect Kießling’s position to have been influenced by the occasionally seen overblown claim that “morphology is the only way to classify languages”, some kind of a broken-telephone exaggerration of the fact that morphology can often provide very strong evidence for classification (but is in no way the sole possible type of evidence).
[3] Or even already a uvular *qʼ, which could have later on reverted to a more neutral / less marked *kʼ in languages in closer contact with Omotic or Ethio-Semitic. But this all surely still requires a good areal survey of the whole of Cushitic and environs. I do suspect this is not entirely Proto-Cushitic anyway, but has ultimately spread from Semitic somehow, though this is complicated by the *ḳ > q shift being in modern Semitic limited to Central Semitic (Arabic etc.), absent from both Ethio-Semitic & Modern South Arabian as far as I know. Would it be a plausible or in any way investigable hypothesis that *kʼ ~ *q variation existed earlier in Ethio-Semitic too, and was just later levelled out in favor of the ejective? If contact with EthSem. is hypothesized to have brought about *q > in modern Bilin (in Agaw), then it is at least conceivable that the same could have happened also in some of the currently–smaller Ethio-Semitic languages; and perhaps not before them having passed uvularization “on to” various more southern Cushitic languages.
— A second, still more speculative idea that I could consider is that perhaps spontaneous kʼ > qʼ is in fact a natural sound change, and is merely blocked in most of the world’s language stocks with ejectives due to the fact that they happen to already possess also distinctive uvulars? This is after all the case almost universally in the Caucasus, quite widely also in western North America, and common even for several more isolated lineages with ejectives, e.g. Aymara, Itelmen, Mayan, Tuu.
[4] Projected also further to PSC and PC, and finally, in his PAA reconstruction, proposed to correspond with Proto-Omotic *pʼ, itself projected from the North Omotic branch. Actually, no actual cases of a correspondence of North Omotic *pʼ ~ West Rift *b ~ East Rift *p seem to exist: Ehret’s book on PAA lists 18 cognate sets with initial *pʼ-, but only five are attested in both Omotic and Cushitic, of these none in South Cushitic. His *pʼ thus really breaks down into disjoint sets of cases rooted in a South Cushitic etymon vs. cases rooted in a North Omotic etymon (and several reconstructed from still more indirect considerations like an Egyptian p ~ Semitic *b correspondence — six cases, two of them with a South Cushitic *pʼ cognate and one with a North Omotic *pʼ cognate).
[5] I don’t know OTTOMH if these comparisons are supposed to suggest earlier *nt, *ŋk in Bantu or the development of prenasalization within Dahalo itself.
[6] Per Ehret a derivative from the same root as the previous; per K&M instead derived from *tsʼaʕ- ‘to appear’, looking more likely since they document -ʕ- and not -ʔ-. Both seem to agree on reduplication.
[7] Ehret fails to recognize even this change as areal, and instead operates with allophonic implosion of word-initial *b- *d- *ɖ- already in PSC + its later reversion in most Rift languages.
[8] From the collection In Hot Pursuit of Language in Prehistory: Essays in the four fields of anthropology, ed. John D. Bengtson.

Tagged with: , , , , ,
Posted in Commentary, Methodology

Prospects in comparative Cushitic

Long time no post! Those who have been following me on Tumblr or Twxttxr will know I’ve been recently digging into the history of the Cushitic languages — actually something I’ve wanted to get into for a while now. Here is a small outline of findings so far for the main blog too.

Cushitic is in some ways very similar territory to what I’m used to in Uralic studies: a family of 30–50 languages depending on how closely you’re counting; split into numerous small subgroups; recorded only recently, without long written traditions; thought to be a fairly old stock regardless; with at least one long-standing theory about its deeper relationships, but a loose enough one to be not much help for its own study. Some further similarities continue also in the research history of Cushitic, perhaps above all the fact that reconstruction has mostly not involved full-family comparison, but rather, comparison within major intermediate areals, some of them themselves identified by relatively weak signals. This probably should be expected to give fairly similar problems to what went down earlier on in Uralic studies due to too much reconstruction focus on assumed units like “Finno-Volgaic” that might not even exist after all.

Traditionally, Cushitic has been split in four main chunks: Northern, Central, Eastern and Southern (former “Western Cushitic” by now split off as Omotic, possibly unrelated entirely, as well as possibly itself multiple families). Northern = Beja and Central = Agaw are clean units, the former a single language, the latter a small distinct family — though there might be room to suspect some similarities within it arising not from common descent but from strong common Ethio-Semitic influence. [1] Eastern and Southern, however, have no consensus in literature for their extent! The original division was simply by geography: Eastern existing as a contiguous zone in Ethiopia, Eritrea, Djibouti, Somalia and northern Kenya, vs. Southern more scattered across Tanzania and partly Kenya. This is of course a priori about as reliable as claiming that Armenian must belong in Iranic because it’s an immediate neighbor of Persia, while Ossetic must not because it’s all alone in the Caucasus; and no wonder that later on other, quite different hypotheses have been appearing too. There are at least the following:

  • a South / East divide exists, but Dahalo, the one Kenyan language classically counted in South Cushitic, is actually Eastern (thus M. Tosco; followed by Glottolog)
  • Southern is altogether a subgroup of Eastern (thus R. Hetzron);
  • neither full Southern nor full Eastern are valid (roughly thus L. Bender, also Wikipedia in their primary navigation).

An additional problematic concept is “Lowland East Cushitic”, in all definitions including at least four clear small units (Oromoid, Konsoid, Arboroid, Somaloid, maybe further subgroupable e.g. as O+K and A+S) and variously proposed to include up to four more (Saho-Afar, Dullay, Yaaku, Dahalo). Its original definition seems to have been as a rump group in contrast to the clearly distinct Highland East Cushitic subfamily, and thus inheriting almost all doubts we might have about the validity of East Cushitic. Moreover, LEC notably includes Oromo, the largest and most geographically central Cushitic language, already known to have left loanword strata in many of its smaller neighbors. Any surface-similarity or lexical-similarity definitions of LEC and EC should be probably considered dubious until it can be ruled out that this cannot be just due to contact influence from Oromo (in a few cases maybe also: from the almost as large Somali?), or just due to already Proto-Cushitic archaisms, perhaps lost in smaller groups further out. This, again, has a clear parallel in Uralic studies too — the already 19th-century discovery that, once a large stratum of loanwords from Finnish is excluded, the Samic languages are actually not that closely related to Finnic and certainly not regular members of the group, [2] followed soon by the observation that ruling out F → S loanwords is necessary also for achieving reliable results in reconstruction.

Paths forward in Cushitic reconstruction would then surely have to either focus on whatever clearly constitutes valid units (which may help with identifying what in them is actually inherited Cushitic and what might be later arealisms); or on working on Cushitic as a whole. The latter would not be an easy task, since much of Cushitic remains at a middling state of documentation [3] and also much of the literature is poorly available to me. The former has however some easier openings. At least three monograph-size smaller-branch reconstructions exist already as entry points: Agaw, Highland East, and Southern. (Sizable descriptive works cover a few other branches too, e.g. Dullay; I do not know how much comparison or reconstruction they involve.)

The Agaw data (Appleyard 2012, A Comparative Dictionary of the Agaw Languages) was my first toe-dip into comparative Cushitic, already since five years or so ago. It’s however quite thorough work given the available data, and also such a distinct branch, that doing anything more with it would require either more basic documentation, or more general experience in Cushitic (and probably also: more familiarity with Ethiopian Semitic). A few months ago though I started delving into Highland East (Hudson 1989, Highland East Cushitic Dictionary), and here openings for improvement exist left and right — I’ve already started working on a paper outlining the properties and impact of the obvious Oromo loanword stratum in most of the languages, and I’m also tempted to compile some phonological, morphophonological and lexicological observations eventually. See e.g. my Tumblr sideblog link above for several additional details.

Most recently I’ve also taken a first look at the Southern data (Ehret 1980, The Historical Reconstruction of Southern Cushitic Phonology and Vocabulary). This raises fairly different questions. Christopher Ehret has acquired some notoriety later on for proposing extremely messy reconstructions of Proto-Afrasian and Proto-Nilo-Saharan. It turns out that similar issues come up already here: etymologies with poor semantics, ad hoc morphology, overimaginative phonology (for examples see my social-media-formerly-known-as-Twitter link above). Still, it is clear that the Southern Cushitic languages in fact are all related, be it just by themselves or within Cushitic as a whole, and a reliable reconstruction outline of the clearly valid Rift subgroup could be probably extracted from Ehret’s work without too much extra effort (a partial version I believe has already been done for West Rift). Only two languages stand outside of it — Dahalo, as already noted above, and also Ma’a, a curious Bantu–Cushitic “mixed language” (probably better: a large Cushitic lexical stratum maintained through recent language shift to Bantu). In principle, there is no reason that their comparison with Rift should not give at least an approximately valid reconstruction, even if it might end up really being for some larger unit.

It’s also again Ehret who has done the most work on overall Cushitic comparison! (1987, “Proto-Cushitic Reconstruction”, actually not a monograph though but one of those ~200-page-long megapapers.) Weak etymologies are now more numerous still, but his Proto-Cushitic takes into account also earlier work and does not err too far off into fantasy, I think. His own Proto-South Cushitic also remains a major phonological and etymological input for the reconstruction, probably leaving issues to be fixed in the reconstruction of several per se valid comparisons. A good illustration of how scholars left to work on major topics all by themselves, unchecked by colleagues, risk drifting into speculation. This work demonstrates well also an issue about reliance on intermediate reconstructions… As soon as a lower-level reconstruction has been set up and is thought to be mostly reliable, this should also trigger checking some “reconstruction upwards” on which details of any intermediate reconstructions might call for adjustment. Ehret identifies himself a few of these already; but it seems not easy for anyone else to continue the task just off the cuff, when he is otherwise content to present for his Proto-Cushitic etyma normally just their Beja, Proto-Agaw, Proto-Eastern and Proto-Southern reflexes, i.e. not their lower reflexes, in e.g. Dahalo or even just in something like Proto-Rift. Still, this could be surely done with some editing effort and enough other sources on hand, and maybe leading to reasonable grounds for working out issues such as where does Dahalo group in Cushitic after all. (A few papers even have a curious suggestion that Dahalo contains both an “Eastern” and a “Southern” lexical stratum. Unclear to me though if this is based on any kind of clear phonological evidence, or just on lexical distribution, and then also, how would we decide which one of these is inherited and which one due to contact? Remains to be seen.)

Initial reading of comparative Cushitic is also helping me to put more trust on its own validity within Afrasian. The family is diffuse and challenging to reconstruct, yes, but nowhere near as diffuse and challenging as e.g. the Egyptian–Semitic relationship (per how I’ve seen it presented anyway, I am not an expert of either), and at least some 200-ish clear cognate sets exist — Ehret has up to 650, but less than half of these look trustworthy to me just off the cuff. Moreover, one earlier reason I’ve had for suspicion is that Ehret’s later reconstruction of Proto-Afrasian comes out almost identical to his Proto-Cushitic. But likely this is not actually a strong sign of Cushitic as a maybe-paraphyletic junk zone — instead, it seems likely to be an artifact of Ehret’s own methodology: starting from his own already overengineered Proto-Cushitic and then finding that some materials elsewhere in the vast Afrasian family can be easily matched to it. It would surely be interesting to see what becomes of that picture, too, if pared back to just the cleanest Cushitic and Afrasian etyma… In the meanwhile though, it might be more profitable to focus on improving the inner-Cushitic reconstruction than to jump straight to its more distant relatives.

[1] So far I’m particularly suspicious of the phonological restructuring of the vowel system, very similar in both Agaw and Ethio-Semitic.
[2] In later times now refined to include also e.g. Karelian loanwords in Eastern Sami, early Samic loanwords in common Finnish–Karelian, old Germanic loanwords into both, some false cognates entirely, etc., which has further allowed mostly abandoning even the idea of a Finnic and Samic as two sister branches in a common Finno-Samic group. Very little unique similarities remain that could not be identified as either shared archaisms from Proto-Uralic (or some slightly narrower intermediate grouping), as shared arealisms, or as trivial independent innovations.
[3] But not poor, I would think. There by now does not seem to be much of Cushitic going undocumented entirely, and many modern grammars and dictionaries have appeared over the last 30, 40 years, many even by native speakers. Older Italian-colonial-era records also seem to have been relatively thorough already, with better coverage than in many other parts of Africa.

Tagged with: , , ,
Posted in Commentary, Methodology

Notes on Janhunen’s Law

(Part ca. 3 of n in my irregularly scheduled series of Introducing Named Soundlaws in Uralic Studies. [0])

The issue, as I see it

Most of the vowel correspondences we now think to be regular between Samoyedic and the rest of Uralic are those that were outlined by Janhunen in 1981. The actual sound laws behind them have regardless often gotten re-tooled or re-dated by now, much in the same way how many of them already had earlier precedents in some form (primarily from Lehtisalo or Steinitz). E.g. the chainshift *e > *i, *ä > e has been by now shown by Helimski to be post-Proto-Samoyedic, given Nganasan evidence for *e > †e > . On follow-up, also the reflexes of *ä > “*e” can be relatively open in some languages: Salminen (2012) has pointed this out about modern Forest Enets (e.g. *tät³tə > tät ‘4’), and to me it seems e.g. that the conditional developments *ä-a, *ä-å > *a in pre-Selkup also seem to presume an open value for *ä. Cf. *ān-uj ‘true’ < PS *änå, or *kuəsə ‘iron’ < *wåsV < *wasV < PS *wäsa.

What I call “Janhunen’s Law” is, though, not any sound change in Samoyedic, but a proposal that he had in the same paper for an innovation in some uncertain amount of western branches: PU *oCə > *uCə. Sammallahti (1988) indeed adopted it as an already Proto-Finno-Ugric innovation. Since then though there does not seem to have been too much support for it — but then neither critique or any other analysis either.

On any kind of closer look, it does seem clear this cannot be quite as simple as Janhunen suggests. First of all, also a correspondence western *o ~ PS *o exists. Janhunen identifies two examples: *koj(-wV) ~ *koəj ‘birch’, *kopa ~ *kopå ‘bark’. This number can be increased: clear examples also include *koj(ə)ra ~ *korå ‘male animal’; *kokə- ~ *ko- ‘to check, see’ (all of these with *ko-, but this looks simply accidental; *ko- > *kå- can be also attested in e.g. *kåmpå ‘wave’, *kåsə- ‘to dry’, *kåət ‘spruce’). Possibly also *ńoxə- ~ *ńo- ‘to pursue, hunt’, though Janhunen assumes that Finnic *nouta- continues earlier *ńux-ta-, thru a similar lowering as in *sou-ta ‘to row’ ~ PS *tu- < PU *suxə-, and this does not look entirely impossible.

I’ve observed already long ago (first presented at the 2nd International Winter School of FU Studies in Szeged in 2014) that there seems to be evidence for further conditioning. First, all of Janhunen’s positive examples involve front consonants in the medial consonantism: alveolars and labials. Four cases are immediately unambiguous:

  • *lumə ~ *jom ‘snow’;
  • *kusə- ~ *kot- ‘to cough’;
  • *purə- ~ *por- ‘to bite’;
  • *tulə- ~ *toj- ‘to come’.

I would add first of all two cases that should be reconstructed with *-w- and not, as proposed by Janhunen, *-x-:

  • *śuwə ~ *śo(-j) ‘mouth, throat’; *-w- is clearly indicated by Southern Sami tjovve.
  • *tuwə ~ *to ‘lake’; *u reflected at least in Permic *ti̮. Original *-w- seems to be indicated by Northern Khanty *tŭw, Konda tŏw, and maybe the oddly front-vocalic təw in rest of Southern Khanty. [1]

Probably even a third is *luwə ~ *lë ‘bone’. *-w- is again indicated by Western Khanty forms — mostly rhyming with ‘lake’, e.g. Konda tŏw, other Southern təw, Nizyam tŭw, Kazym ɬŭw (but in Obdorsk lăw, versus tuw ‘lake’). Samoyedic *ë could indicate a shift *ëw > *ow in other languages already before *o-ə > *u-ə (a tentative Proto-Finno-Ugric innovation — though this seems a bit too trivial and devoid of parallels to be relied on for that).

One additional example that was not known to Janhunen shows a palatalized alveolar medial: *wuďə ‘new’ ~ *oj- > North Selkup oć-əŋ ‘again’, a neglected etymology from Helimski (1976). [2] Note further that positing *o > *u here explains the rare initial combination *wu-, not reconstructed anywhere else in Uralic vocabulary and probably phonotactically impossible in Proto-Uralic proper.

Looking beyond Samoyedic, it also seems to be the case that from the evidence of other languages, we cannot really reconstruct word roots of shapes like *CoPə, *CoTə, *CoRə. The best two contenders are *monə ‘many’, *wolə- ‘to be’, but the first is readibly under doubt as being a loan from Indo-European (also Permic *-mi̮n, Mansi *-mān, Hungarian -vAn in names of decads does not particularly have to be related to ‘many’ in Finnic and Samic), and the latter looks more likely to have been *walə-. On the contrary, many reconstructions of the shape *CoKə have been already presented: at least *jokə ‘river’, *rokə- ‘to hack, cut’, *soŋə- ‘to enter’, *šokə- ‘to say’, *toxə- ‘to bring’; maybe also e.g. *poŋə ‘bosom’, *oŋə ‘hole’ (if not rather *poŋŋə, *aŋə). I take this also as grounds to suppose that there has indeed been a sound change *-oCə > *-uCə, for C ≠ velar.

I suspect also palatal *-j- might have blocked raising: cf. *kojə ‘male’ (though this is mostly continued in derivatives like *koj-ma, *koj-ra). An interesting case on this front is ‘to swim’, usually reconstructed as *ujə- per Finnic (Finnish uida, Estonian ujuma etc.), but most cognates (clearly at least Samic *vōjë-, Mordvinic *uj-, Permic *uji̮-, SKhanty üj-) better point to *ojə-. As I’ve noted by now in a talk from 2018, even within Finnic, Livonian vȯigõ (? < *oi-kV-) seems to still retain *o. The reflex in Samoyedic, on the other hand, mysteriously enough, is still indeed *u- or *uj-.

An alternative view?

The only counterproposal in any clear detail that I’ve seen comes from Jaakko Häkkinen, first in his Master’s thesis and later, much more briefly, on his 2009 paper on locating Proto-Uralic. He suggests inverting Janhunen’s Law, to apply in Samoyedic and not outside of it: *CuCə > *Co(C). I have seen / heard something similar by other colleagues in a variety of discussions, but I do not recall any defense of this being published. At most, see some discussion in this blog’s comments starting here, with Ante Aikio listing some notes about *o ~ *u variation within Samoyedic and additional irregular-looking examples of *o. Among these I would doubt at least the reconstruction PS *počå- ‘soak, ooze’, though. This probably refers to the words appearing in UEW under *poča- ‘become wet’; but Nganasan and (with irregular b-) Kamassian seem to point rather to *påTå-, with evidence for *o limited to Nenets–Enets. Or, since (old) Nganasan fo- can continue not just *på- but also *pə-, and Enets has o < *ə regularly, another option, maybe better still, would be that this was *pəčå- in PS after all, as would be expected per the Udmurt, Khanty and Mansi cognates; and that the Nenets word is a loan from Enets, while the Kamassian word doesn’t belong here at all. (Donner’s original data actually has not just a voiced b but palatalized , which is also difficult to explain.) In some other examples I don’t see any particular reason to think that they point to secondary *u > *o rather than secondary *o > *u (thus so maybe in “*num” ‘heaven’) or to *o at all (thus so in Nganasan tui ‘fire’ for expected ˣtüi: this looks like unclear retention of *u, which has other parallels).

Anyway, the major problem that I see in the inverted approach is explaining where Proto-Samoyedic *Cu(C) then comes from. There is solid evidence at least for a rime *-uj:

  • *tuj ‘fire’ < PU *tulə (a minimal pair with *toj- ‘to come’!);
  • *uj ‘pole’ < PU *ul(k)ə;
  • *kuj ‘spoon’ < PU ? *kujə (cf. Finnish kuiri ~ kuiru ‘id.’; I am not committed either way on if proposed Komi and Ob-Ugric cognates meaning ‘trough ~ mortar’ belong);
  • *puj ‘eye of a needle, etc.’ < *pujə.

The last two probably show PU *-jə > ∅ and PS *j as some derivative suffix, [3] but this alone cannot explain *u rather than *o, since also the latter readily occurs in CV stems: *ko-, *ńo-, *to, *śo-j. A few PS roots also show *u: natively at least *tu- ‘to row’ < PU *suxə; of unknown origin, *ku- ‘cord’, *ju ‘warm’ [4]. Some other CVC examples can be found too, including *pur ‘smoke’ < PU *purkə; *ut ‘road’ < PU ? *uktə. But at least these two examples we might argue to be irrelevant due to continuing PU *u in an original closed syllable, just with exceptional loss of *-ə after some probably very early cluster simplifications.

As comes to the lack of PS roots of shapes such as **Cup, **Cun, **Cuŋ, this could indicate that something happened to such cases, but it doesn’t follow that the result must have been *o. Other options would readily include reduction to *ə, already suggested by Janhunen in e.g. *təŋ ‘summer’ < PU *suŋə.

Future hypotheses

So far I do side with the hypothesis that Janhunen’s Law is a real phenomenon. Its exact extent and conditions seem to require review, however. I have some reasons to suspect that PU *o was in *CoCə stems retained not just in Samoyedic, but partly also elsewhere. E.g. *purə- / *porə- ‘to bite’ yields in Permic *puri̮-; *tulə- / *tolə- ‘to come’ yields in Mari *tola-; both more in line with development from *o than *u. An interesting recent discovery, premiered a few weeks ago on Twitter, has also been to note Khanty *lāńć ‘snow’ (> e.g. Surgut ɬ´åńť, Nizyam tɔńś, Obdorsk laś). UEW derives this from a distinct *ľomćɜ, listing here also some derivatives of PS *jom and probably incorrect Kola Sami reflexes meaning ‘frost’. But if we did reconstruct *lomə and not *lumə already in PU, the Khanty words, too, can be simply considered derived reflexes, at the PU level seemingly *lom-ća: *o-a > *ā is regular, and there does not seem to be counterevidence to assuming *mć > *ńć. Closer review might identify more cases like these that support the reconstruction of PU *o in the involved words.

As more of a long shot, there are also two unclear cases where evidence for *o might be found in Indo-European. For one, ‘to bite’ seems compareable with PIE *bʰe/orH-, root meaning probably ‘to strike, pierce’. The PU verb also probably meant specifically ‘bite thru’ (in contrast to *soskə- ‘to chew’), coming fairly close to ‘pierce’. Its descendants can be also used not of just biting with teeth, but also working with tools (cf. e.g. Fi. sahanpuru ‘sawdust’, as if “saw-biting”) — similar later development is attested in derivatives on the IE side too (Latin forō, Germanic *burō- ‘to bore, drill’) [5] and LIV goes as far as to give a gloss ‘mit scharfem Wergzeug bearbeiten’. Distribution all the way into Samoyedic makes it difficult to assume loaning, though, while a hypothesis about an old Indo-Uralic cognate would not, at the current state of research, rule out an original *u that was lowered to ablauting *e/o in PIE. — For two, there is Finno-Mordvinic *unə ‘sleep’, which Koivulehto (1991) has already compared with Greek ὄναρ, ὄνειρο- and explained exactly thru Janhunen’s Law: early IE *oner → early Uralic *onə > *unə. Whether the Greek word goes back far enough in IE for this to be feasible looks very dubious to me though, especially when there is a much better-attested PIE word for ‘sleep’, *swépnos.

A yet further possibility I would wish to look into in more detail in the future is, does the raising of *o that we seem to see really have the “same” *o as its starting point as is usually reconstructed in PU? Namely, traditional PU *o is in Samoyedic by default lowered to *å — such that its “survival” in Janhunen’s Law cases really looks to be also innovative really. As outlined in yet another presentation a few years ago, I have also developed a hypothesis that the unbalanced inventory of rounded vowels in Proto-Uralic: *ü *u *o but no **ö, probably comes by a chainshift from pre-PU *u *o *ɔ. (I have not discussed this on the blog in detail so far and, alas, cannot do so right now either.) Then, the common tendency of PU *o to be lowered to *a / *å probably indicates that this chainshift had actually not fully taken place by PU: that “*o” was really still open-mid *ɔ. Janhunen’s Law positions, however, look like they might have already had close-mid *o. This would allow us to do away with a raising that happened all across “Finno-Ugric” with seemingly no motivation, while still also not folding the vowel correspondence entirely into PU *u.

There would be also another option on the relationship of this *o with my pre-PU *u *o *ɔ. Rather than early raised cases of (pre-)PU *ɔ, they might be also straggling non-raised cases of pre-PU *o… And then was this *o really just an allophone of *ɔ either? *u is a very common vowel in PU, and perhaps this is partly because even some further cases should be likewise reconstructed as *o. This might be possible if we identified other evidence for it than retention as *o in Samoyedic. For the sake of example, one case might be Mansi *u: PU *u yields in Proto-Mansi either *u, *ŏ, *ă with no very strong conditioning apparent. (Some similarly open issues remain in Khanty and Hungarian.) So just maybe … could it be that PMs *u is a sign of PU *o as distinct from both *u and *ɔ in general? such that not only will we then reconstruct PU *por- ‘to bite’ (> PMs *pur-), but also e.g. *końćə ‘urine’ (> PMs *kuńćə), with *o > *u now also in Samoyedic in this environment (> PS *kunsə)? This would even have a good parallel among the front vowels: PMs *i is generally from PU (close-)mid *e, not from close *i. — But in the interests of putting these notes finally out at least in a somewhat assembled form, I will leave this line of thought open for now.

[0] See previously at least: Lehtinen’s Law; Moosberg’s Law; and one that definitely requires a name but I’m still mulling over what to call it precisely is *Ä-backing in Finnic. Several future installments remain planned too.
[1] On the contrary, an irregular fronting already in Proto-Western Khanty would also account for most of these reflexes: *tŭɣ > *tü̆ɣ > *tĭɣʷ > *təw, preserved in SKh and giving NKh *tŭw (cf. e.g. ‘fall’: PKh *sü̆ɣəs ~ *sü̆ɣs > SKh səwəs ~ süs, NKh *sŭws or *sūs). But it seems preferrable to me to restrict this irregularity to Southern Khanty and treat Konda tŏw and NKh *tŭw as regular reflexes. — Maybe there is some possibility that the SKh development here and in ‘bone’ can be explained as *ŭw > *ū > *ǖ > *ü̆w > əw, leveraging the known fronting *ū > *ǖ? It doesn’t look like *ŭw and *ū actually contrast at all, so the first step here might be entirely virtual.
[2] Хелимский, Е. А.: О соответствиях уральских a- и e-основ в тазовском диалекте селькупского языка. – Советскoе финно-угроведение 12: 113–132. No cognates known elsewhere in Samoyedic, but the simplification *wo- > *o- would have to be pre-PS anyway, since by PS a new *wo- does exist and per two examples yields in Selkup *ko- as expected: *woəj > *ko ‘island, hill’; *wotå > *kotə ‘blueberry’.
[3] Though, since PS shows *r > *l / C_ in various suffixes, could it be possible that after *j, the resulting cluster further coalescend to *ľ, and then evolved into just *j as usual? In this case Fi. kuiri and PS *kuj could both go back to PU *kujrə (now with no especial reason to suspect a suffix in there).
[4] For a formal match and semantics within speculation distance, cf. PU *luwə ‘south’ ≈ ‘direction where the weather is warm’?? Seems unlikely but not impossible.
[5] And cf. further PU *pura ‘drill’, also already proposed to be an IE loan. So far it seems morphologically unclear to me how to connect this with either the PU or PIE verbs, though.

Tagged with: , , , , ,
Posted in Reconstruction

State of the Blog: Second Decade

Blogging here at Freelance Reconstruction has been slowing down in recent times, as we approach the 10-year anniversary of its WordPress iteration, coming up just at the start of the next year. [1] In 2013–2019 I have been writing about 1–2 articles per month; in the 2020s so far, less than 10 per year. To be sure, some of life’s external issues and circumstances have also been getting in the way, starting already with the obvious: CoViD-19 and issues downstream of it. But this also coincides with me finally being now at the rank of a graduate student, and being not just welcomed but expected (as of this year, by actual funders even [2]) now put out my ideas as proper peer-reviewed publications. There is a whole bunch of work to do on this. Or indeed re-do: it feels like every article draft I sketch out ends up with at least one footnote to the effect “for earlier discussion of this, see Pystynen 2014 [blogpost]”.

Another turning point approaches too: where this blog will, at last, have more published than unpublished posts, both being at ca. 160. This may give a hint to what extent I have also quite a lot of unpublished research, most again formulated back in the mid-2010s, still stewing in my blog drafts. This is a situation that definitely calls for skipping over a step in the publication pipeline and refactoring this corpus, too, into other forms, now that I am able to do so. And this also does mean much fewer blog posts coming out as intended.

Even a third venue to air my ideas is by now moreover the Finnic Etymological Wiki Database, which I have been setting up over these same few years, under the folds of our project on writing a new etymological dictionary of Finnish (which uh, I don’t think I’ve ever announced here in detail; partly since it’s being written in Finnish). The platform is intended not just as a data backend for the dictionary, but also for discussion among scholars, e.g. for proposing new etymological ideas that do not seem quite ready for publication just yet. I’m by now doing this with some frequency, instead of spending more work on turning them into etymology squibs here (sample: is Mordvinic čakš ~ šakš ‘pot’ not a cognate of Old Finnish haaksi ‘ship’, but maybe a derivative from čava ‘plate’, if from earlier *šaɣa?). — Any colleagues interested in this, and with serious familiarity with Finnic etymology at least, are also welcome to request an account from me or the rest of the moderation team for contributing to the discussion.

By no means do I wish to abandon blogging altogether. But I may aim to shift away from the more effort-demanding blogposts to the effect of a mini-research article, at least as long as blogposts continue to be neglected by the powers-that-be as a recognized type of research output. Perhaps I will focus more here on reviewing issues, or bringing up points already made about them in the literature, than on presenting major syntheses on what to do with them. It remains to be seen how this will work out. But you can probably at least expect to see the next few Uralic reconstruction posts appearing here to be rather in this paradigm. Of course posting of other matters, e.g. on the state, context, philosophy and methodology of historical linguistics, is likely also going to be continuing on to the next decade. And maybe I will yet get around to re-hauling the site’s appearence or organization, as already hinted in 2019.

Thanks to all readers and commenters, and see you in the rest of the 2020s!

[1] The decennary of my linguistics blogging altogether has already slipped by about a month ago…
[2] I would also like to take an opportunity here to issue my thanks to Ante Aikio and Martin Kümmel for letters of recommendation to go with my funding pitch.

Tagged with: , ,
Posted in News

Long-Distance Comparisons As Butterflies

One of the rationality-cluster blogs here on WordPress, Aceso Under Glass, a while ago posted about a concept I find immediately useful: “Butterfly Ideas“. Roughly speaking, hypotheses that need further development, are probably not ripe for serious criticism as they stand, but could benefit from preliminary discussion (read the full post for more).

On this blog and elsewhere, I have repeatedly entertained a variety of “long-distance” linguistic relationships: Nostratic, Uralo-Yukaghir, Uralo-Eskimo, the works, despite not being so far highly committed to any of them. One idiom I’ve previously used to defend this is “big fish are worth angling even if you don’t catch any”; that there are major potential gains for our understanding of history (both intra-linguistic and extra-linguistic) if any of these theories start to prove themselves in more detail. Or as the more succinct modern spin goes, “big if true”. A second motivation is provided by what I have called the “cell theory of language“: spoken natural languages only come from other natural languages, never out of nothing. [1] This gives a strong prior that all natural languages are, indeed, related, even if we currently lack the knowledge of the details. Factoring in also anthropology further gives strong reasons to believe also in the existence of a number of “bottleneck proto-languages”, such as Proto-Australian, Proto-Amerind or Proto-Exo-African. So big fish are very likely indeed out there, even if we are not sure if our lures are working. Though then these are weaker boundary conditions that do not establish what currently-known families exactly would be the daughters of such a proto-language. E.g. who knows if some American languages might be not Amerind ≈ Beringian, but something else, like para-Na-Dene, pre-Clovis-coastal, Solutrean…? Continuing the metaphor, this would mean we don’t even know how big the fish are exactly, and so also we might not know (yet?) what are the best ways to catch them.

But there’s also a sense in which I think long-distance relationships would be better seen as butterflies than big fish. We do not find relationships in an instant, as sudden flashy discoveries (by “bites” on a “lure”). All spoken languages are in principle compareable, with known typological differences but also universal family resemblance. [2] The universality of basic phonological categories in particular makes it possible to find some resemblances between any two languages that plausibly could be indicative of some etymological or indeed genealogical relationship. Whether they actually are, depends on additional work on fine-tuning details. Are they above the level of pure chance, and independent of known onomatopoetic and nursery word trends? Are they in conflict with other data of equal value? Do they show recurring sound correspondences, at least some of them nontrivial? These are questions for which we cannot expect to have every answer in place immediately. Any relationship must always begin from observing some similarities that are not probative in itself, and then pursuing this as a hypothesis and seeing if it guides us to more similarities, ones that will not require further costly assumptions to justify.

If all we knew about Finnish and Hungarian were that their verbs for ‘to live’ are, respectively, elää and él, this would not be sufficient evidence to establish them as related languages. But they are, indeed, cognates. Insufficiency or statistical insignificance does not in any way refute cognacy per se. And it is true that checking for more examples of the correspondences e ~ é and l ~ l turns up more evidence such as pelätä ~ fél ‘to fear’. Now with a new correspondence p ~ f, but this does not mean we turn up our nose and declare the hypothesis unworkable: it’s possible to continue and maybe discover, say, pesä ~ fészek ‘nest’. It always takes several steps like this to assemble e.g. a phonological core that will be self-evidently non-accidental. Same for other “evidential cores”, such as partial common morphological paradigms. There is no immediate bite that instantly proves a relationship, but rather, a first weak signal, which will rise in importance once combined with a proper selection of other datapoints.

Any “minimum convincing argument” will not be dozens of steps deep necessarily, but where patience is especially needed is that at any stage there will be plenty of false paths of expansion that will not lead to a workable theory. If at some early point, we had formed a hypothesis of a ~ a, and then run into vapaa ~ szabad ‘free’ (without realizing that both are loanwords from Slavic) — we could still find more evidence also for p ~ b (e.g. by misanalyzing the correspondence Fi. mp ~ Hu. b), but no additional good evidence would be turning up for v ~ sz. At some point we might end up concluding that, yes, this is going nowhere and should be discarded. But then only this comparison! Finnish and Hungarian are still ultimately related, even if their words for ‘free’ are not cognate. Discarding this one comparison does not (should not) mean discarding also any other adjacent comparisons. A burgeoning comparative edifice needs to be open for exploration and individual mistakes, if it is to ever reach any particular rank like “a probable relationship” or “a proven relationship”.

This plea of course has also a corresponding inverse. Anyone who wants a “butterfly” treatment of their ideas has to have enough intellectual humility to recognize that it is, indeed, a tentative first-pass version. All too often I see also people who have a new language relation hypothesis in hand double down on their speculation, and not be open to even constructive criticism. Perhaps in some part there is a misunderstanding where people do not recognize the proposal of better, non-cognate etymologies (borrowing, onomatopoeia, internal derivation) as progress. But certainly also lone-wolf-genius-ism, and its attached incapacity to admit mistakes, is a problem that exists.

On the other hand, I don’t think this side of the problem needs to be focused on too much. In historical linguistics, the exploration of linguistic relationships is already a known research programme, a goal that many people agree to pursue even if we tend to disagree on quite a lot of details. This in mind, if a K. Kookenstein puts out a paper on allegedly showing how English is related to Arabic, but then refuses to consider these comparisons in light of what Indo-European or Semitic linguistics has to say on this: we don’t actually need his approval on this! Language data is not locked, copyrighted, or in any other way tied down to one person, and if desired, it will be possible in any case to check such papers for insights relevant also to better situated IE–Semitic comparison. I know I at least keep a few “Hungarian is too a Turkic language” type works around for this purpose. The intended main thesis is not going to pan out; but any data cited to this end could prove to be regardless still valid. Usually anything of this sort mostly relies on word comparisons (appeals to typology are strangely rare), and these might remain valid as etymologies of any imaginable type… not just Turkic loans in Hungarian, but maybe also old Hu. loans in Tk.; Hu. cognates of Khanty or Samoyedic loans in Tk.; common loans from some third source like Iranian or Yeniseian or Mongolic; some could even end up being evidence for a general Turkic–Uralic relationship. None of this is a priori ruled out, and in this way it may well be possible, with patience, to find meaningful building blocks even within theories that don’t hold up in their entirety. Such is a nifty property of historical linguistics, something that definitely doesn’t generalize to every science.

The two animal metaphors from the start of this post, though, no longer work very well at this point. Some butterflies … may grow up to be big fish, even though most probably don’t? Moreover, I have been mostly illustrating this discussion with disputed-but-definitely-published ideas. More nascent ideas that are simply brought up in a discussion are a different beast for sure. Of course there’s a selection bias here too: the actual butterfly ideas I do have, you will probably not be seeing on this blog as such (and you might have to watch closely to catch any even on my side channels). [3] Arguably also scientific publishing is “a conversation”… especially any ideas that can be so far found only in some paper draft posted for comments online (in linguistics they’re not even concentrated yet on any arXiv analogue). For these, the original reading of a butterfly idea seems to still work fairly well. This may hopefully help (e.g.) various long-distance proposals to develop better in the end, before they end up with one of two common fates: shelved as not having passed the judgement of Reviewer #2, or self-published with excessive confidence. For this goal, yes, the ball very much is first in the court of people who do have an idea and want to develop it; but it is also in the hands of the rest of us, in being willing to offer first criticism that’s not a complete dismissal. Thirdly, worth noting, all this also depends on a social milieu where people even can find parties interested in discussing some out-there idea.

A further aspect of AUG’s original concept — avoiding unnecessary emotional stress upon people presenting a new idea — I haven’t really even touched here yet. This would be a whole other jar of larvae, but suffice to say I agree that academic discussion, for all its standards of civility, fairly often can have undertones all the way to hostility. This probably scares away many people without a thick skin who might otherwise have had a few interesting things to say; and those of us who do stay engaged, to whatever degree, it may leave with more stress than is necessary.

Some of it, I’m sure, does not even come from a particular need to be prickly, but from limited time… Sufficiently well-known figures in a field tend to get approached by a disproportionate amount of amateurs with A Revolutionary Discovery, unless they specifically keep themselves hard-to-contact, or, perhaps, maintain an aura of not suffering fools gladly. Again a problem that might be softened with other people being open and approachable enough. But this also starts edging towards the general area of science communication and public relations, a bigger fish still to fry that I’m not going to pretend to already have big original ideas for right now (and the butterflies, they will have to wait for other channels).

[1] The famous case of Nicaraguan Sign Language does not seem to have spoken analogues. In principle there is little directly preventing such a case (and something of the sort, maybe in several gradual episodes, will have to be assumed as the ultimate origin of human language too), but the conditions are unlikely to ever come about. A community of children who are capable of speech but do not have access to any pre-existing spoken language? Sorry, language in general is too adaptive to have been ever abandoned after its first introduction. I will go as far as to suggest that all known human cultures depend strongly enough on language for the transmission of cultural knowledge that any sudden failure of language skills across an entire human group (say, a transmissible disease that induces deafness, fast enough that a signed language does not have time to develop) would not lead to an all-new language being developed a few generations later; it would lead to the group’s extinction.
[2] In the philosophical sense, not the genealogical one. E.g. despite some exceptions, most languages still have nasal or labial or velar consonants; all but the most impoverished and unbalanced phonological inventories or even just consonant inventories are going to have substantial overlap between them. And even if we did find languages that somehow have completely disjoint phoneme inventories (lazy example: one has only stop consonants and front vowels, the other only continuant consonants and back vowels?), they will not be unbridgably far apart: the known typology of sound change allows hypotheses relating basically any two speech sounds. Grammatical categories, too, can be quite different but still only finitely far apart, where the details of known language histories likewise give us ways to relate non-identical categories to each other (or to derive them de novo language-internally, etc.)
[3] A freebie for the sake of example though: cf. some very loose thoughts about the subclassification of Oceanic as floated on Tumblr just a few days ago (also already with some, though not highly severe, critique from a regular correspondent over there).

Tagged with: , , , ,
Posted in Methodology

Language Family Tectonics

Basic research in historical linguistics is mostly done within individual families: we take a swath of attested (in most cases modern) languages, and work towards the past to figure out their development from a common origin, one group at a time. Any knowledge of languages outside the family only really factors in as correction terms: filtering out loanwords and other contact influence, as data that the family’s overall internal history will not need to account for.

What the big picture of this looks like once we consider also geography is that we end up with a series of dots — “homelands” (though not to be understood as points of creation, but simply the last uncoverable phase of earlier processes) — somewhere in the past; some of which have then expanded, to cover the whole world by today. Just a few millennia ago, much of the world would have been an uncharted area, full of regions from which no knowledge of their languages has survived to us. The ones that do survive would, even, have been largely isolated dots. Most language contacts must eventually end (or rather, begin) at some point in the past. Languages of different families, that are today next to each other, cannot all have had their parents too as neighbors. Perhaps some individual cases were: Proto-Germanic seems to have been about as much of a neighbor of Proto-Finnic as Swedish and Finnish are still today; even further back, something like Proto-Kartvelian as a neighbor of Proto-Northwest Caucasian could be possible too. But once we consider highly expansive families, it is self-evidently absurd to propose that Proto-Indo-European could have been simultaneously a neighbor to all of (pre-)Proto-Kartvelian in the Caucasus, (pre-)Proto-Uralic in the taiga zone, (pre-)Proto-Dravidian in South Asia, pre-Basque in Iberia…

This already implies that most borders of today’s language families are collision zones: where two lineages have come to meet that were not in contact at some point in the past. (Same also for some, though fewer, language borders within them.) I’d like to think that we can probably divide them further in subtypes. This will have to include their history, not just their current but also past dynamics. One reasonable analogy might be plate tectonics. Geologists are not content to simply locate the current boundaries of the world’s tectonic plates, but ever since the rise of continental drift to a mainstream theory, already introductory maps will also aim to identify boundaries as either constructive, destructive or conservative. Often longer-term history or future, too, could be extrapolated from arrows of movement (of, yes, actual movement right now — as per the classic example and the mid-ocean ridge closest to me, the Atlantic Ocean is growing some three micrometers wider every hour, already a perfectly visible amount of maybe 0.3 millimeters since I began to write this blog post).

Of course this is not to be aped too closely. The social “forces” that drive linguistic expansions can be rather fickle, nowhere near as stable and predictable as the physical forces of geology in e.g. continental drift. No responsible linguist is going to be putting a predicted specific time of death on any but, perhaps, an already moribund language (those where all transmission to new generations has already ceased, and the only question is whether the last few speakers have 5 or 50 years left to live); and predictions on what languages will be gaining new ground entirely I have not really seen anywhere at all. If anyone wants to register particular predictions, be my guest, but currently these are really only going to be educated guesses, not derived from a theory with known predictive power.

So maybe let’s not draw any future-pointing arrows on linguistic fault zones just yet. Drawing past-originating ones, though, seems like a much more doable task, first of all in cases where (some) history is already known. And this I think also gives us anyway some analogues of geologists’ “constructive, destructive, conservative”. A look at known history actually suggests that just two types might be enough to get started. Of course we can have conservative boundaries, where languages have stayed each on their own side for a while. This often coincides with also geographic boundaries of some sort (e.g. the northern boundary of Indic has been, broadly, at the Himalaya for millennia, and it’s no wonder that the Korean / Japonic boundary has stabilized between the Korean peninsula and the Japanese archipelago). Then we have collision zones, where two lineages come head to head —

But wait. Head to head? No, actually, the most typical case we see anywhere in the world’s known history is not quite this. Where we find e.g. a Germanic / Celtic boundary in the British Isles, a Finnic / Samic boundary in northern Finland, a Turkic / Iranic boundary north of Iran, a Bantu / Khoe boundary in Botswana: these do not represent cases of two spread events that finally arrived at some common ground simultaneously, running out of no speaker’s land to claim. Almost always such a border represents one newer (Germanic, Finnic, Turkic, Bantu) and one older family (Celtic, Samic, Iranic, Khoe), with the latter’s historical range extending far into the former’s current-day one. The geological analogy happens to continue working here too to some extent: when two plates collide, for all the mountains that results, these still are not zones where both plates indefinitely squish and crumple without crossing. Instead one plate will be pushed underneath another, into the crust (and mainly the topmost one will jut up as mountains). Now the distribution of language families does not really have a Z-axis, but the time axis does similar duty here. We already routinely speak of e.g. English expanding (having expanded) “over” Brittonic; and call the latter a “substrate”, the former a “superstrate”, again employing terms from geology that strictly speaking refer to vertical location. I’m sure also a part of the motivation is one of geology’s core findings that, by default, vertical order reflects historical order!

To fully derive an understanding of this situation, the naive zeroth-order model of language family expansion (they start in some some compact area in the past and begin expanding) moreover needs to be amended by the fact that expansions are not infinitely powerful: they can run out of steam even without encountering another expansion in its path. Not only does Finnish supercede various lost Sami varieties, it is also not the case that Samic started somewhere in the north and expanded south until running into Finnic. Rather, Samic also itself originally expanded mainly northwards, probably much along the same geographic routes. There was no southward expansion front of Samic for Finnic to collide with; nor an eastward expansion of Celtic by the time of the Germanic expansions, etc. In this way linguistic expansions might have a better geological analogy still in lava flows in a volcanic field: they will layer on top of another, not by virtue of which one expands faster or more strongly, but by simple virtue of which one has already stopped, at least in a particular area, and which one is still going.

In those cases where two expansions do happen to be going on simultaneously, this is maybe indeed more likely to end up with something resembling a conservative boundary. And also among these, many though will prove not quite entirely stable if we look closely enough. They can turn out to be series of small advances on either side, just not spilling out to outright conquest of the other family (and likewise, mostly not inherently one-dimensional lines anyway, but a crossfade in the proportion of speakers of X versus Y). Again more like lava flows than continents.

Still, I will continue to keep the term “tectonics” here anyway. Etymologically looking, it is not a term that by itself implies the details of plate tectonics, but simply refers to the largest-scale analyzable units.


What can we do with this then? If we recognize that the world’s major language family boundaries are mostly collision zones — where one family is or has been in the process of expanding at the cost of another, not currently expanding one — this gives us first of all convenient rules of thumb about linguistic substrates. Anywhere near a language family boundary, the substrate of an expanding family X is probably primarily the non-expanding language family Y next to it. At least in the wide definition of “substrate”, that is “the language spoken there before the expansion of the current family”. If it has left any discernible substrate influence, structural or lexical or toponymic, would be another discussion entirely. Conversely, locations where we might be able to fruitfully hypothesize completely extinct substrates will be instead

  1. more towards the geographic or expansion centers of recently expansive families (thus e.g. the Paleoeuropean substrates of Germanic);
  2. underlying not-most-recently expansive families that have few or no leading edges over anything anymore (thus e.g. the Paleolaplandic substrate in Samic).

Or further yet. The facts that language families expand from small origins, readily take over other languages in the process, and are also generally just some thousands of years old, leads us to also a more powerful rule of thumb: There Was Some Other Language There Before. Almost no language is the absolute first language to have been spoken in “its” territory. The main exceptions would be a few cases of recent seafarers, above all in Polynesia; several more scattered cases also in the Atlantic, of which I think only Icelandic and Cape Verde Creole have been established as their own languages. [1] At any other ends of the Earth, Inuit is a known newcomer in the American high arctic, Pama-Nyungan is a known newcomer in the Australian interior desert (even if the languages preceding them are not attested)… and in places with long written history, we may find quite extensive known successions, to the effect of Hattic replaced by Hittite replaced by Luwian replaced by Aramaic replaced by Greek replaced by Arabic replaced by Turkish. Maybe some Assyrian or Kurdish phase in there somewhere too, depending on what point we’re considering here exactly. More importantly, over the remaining at least 60,000 years of modern human presence in West Asia without written records, obviously much much more of this still. Not all of this leaves major genetic or archeological fingerprints, either, and some specific cases might be very hard to identify if we didn’t have linguistics itself as a source of evidence.

For two, it will be generally beneficial to work out which of any two language families in contact at a particular border has been the more recently expansive one. [2] Know more widely, at least. I’m not sure if there actually are many cases where this would be a mystery entirely. I could think of some hard-to-tell cases once we’re talking about subfamily borders (Mari / Udmurt? Celtic / pre-Latin Italic?), but even here probably some dedicated experts would have an opinion. Maps of individual language families, especially in historical contexts, often enough also have some spread lines or historical distributions marked. But large-scale summary maps still trend towards presentations like this, seemingly entirely static, even though the process of restricting language families to complementary areas necessarily elides some current-day detail in favor of historical idealization (denoting where a language family “is native” or “is traditionally spoken”). I’ve seen sociolinguists criticize this whole genre of language distribution maps repeatedly already, in them not really capturing synchronic reality. The response though might not need to be to abandon them entirely, as much as admit that, yes, they are maps that display some historical information too, and adjust accordingly for more history-informed design. If there is knowledge on this mostly out there, why not?

For three, a concept of family tectonics readily draws attention to the point that there’s work to be done not just on charting language families’ “current” or “traditional” distribution, but also their past distribution. “Beneath” (before) any current language family there “is” (was) some different distribution of other languages. Some of them maybe belonging in it still extant neighboring families, some maybe its own lost relatives, some maybe unknown entirely.

The first possibility I find the most interesting for the sake of further work. The closest example to my work comes from central and eastern Siberia. An important but I think largely open question would be what was spoken in the area before the expansion of the relative newcomers? Russian is of course the newest layer all over the place, but Siberian Turkic (Yakut, Tuvan, etc.) and Northern Tungusic (Evenki, Even, etc.) are both parts of relatively recent families too. What have they ended up displacing? Early Russian explorers report, and rudimentarily attest to, first of all a formerly wider distribution of the Yukaghir family, today known only in two small islets; and a variety of Samoyedic and Yeniseic varieties in the southwest of this area. Still, the main Turkic and Tungusic expansions must have been early enough to predate all historical records in the region, so this cannot be the whole picture either. One hypothesis I keep coming back to is the possibility of a lost “tenth” Uralic branch — perhaps para-Samoyedic, perhaps an independent branch entirely. This might have some benefits to it in explaining a variety of known but not especially substantial similarities between Uralic and all the other families further east. Turkic of course has been in direct contact with (branches of) Uralic anyway, but various parallels continue sporadically into Yukaghir, Tungusic, Chukotkan, Nivkh, Eskaleut. All of them seem more likely to originate from the Uralic side, due to it being the Siberian family with the most known time-depth. Yeniseian is sometimes approximated as rather old as well, but otherwise both “Neosiberian” and “Paleosiberian” are all families without too much time-depth. [3]

Most notably, Uralic parallels in eastern Siberia include even basic words for ‘reindeer’, an all-important livelihood animal for many groups these days, especially Chukotkan *qora (whence the ethnonym Koryak), Tungusic ⁽*⁾oron (or probably *xoron, with further diffusion after *x > ∅ in NTg) (whence the ethnonym Oroqen). Kolyma Yukaghir qoroj ‘two-year-old male reindeer’ is usually adduced here too, as well as loanwords further into Siberian Yupik. This has been already identified in earlier research as a Wanderwort originating in Proto-Uralic *kojəra ‘male [domestic?] animal’ > Proto-Samoyedic *korå ‘id.; bull reindeer’, which might have already had an allophonic [q-] in Proto-Samoyedic or even earlier. But we seem to lack especially clear evidence on who is to be credited for the original diffusion of this word. Yakut, as far as I know, has no reflex of it, splitting the Eastern Siberian region off from Samoyedic, and thus probably suggesting a pre-Turkic movement eastward. If so, then maybe even already at the time of the original Uralic expansion (which I think must have been partly eastwards too in any case)? Who knows. Maybe someone will eventually though, if we get e.g. some additional toponym data for guidance and keep inter-family comparative research going.

Elsewhere in the world, I’m wondering also about e.g. how far Africa’s other language families might have reached before the Niger-Congo and particularly Bantu expansion. The case of possible contact between Khoe and Cushitic is already preliminarily discussed in a 2009 paper from Blench, though I’ve been unable to verify his interesting claim that Khoe #goe for ‘cow’ would be compareable with similar “widespread terms” in Cushitic. [4] The quite tattered Central Sudanic looks like another good candidate for a family that might have been more widespread earlier (but might have been also enroached upon by Chadic and the various branches of Eastern Sudanic). In the Americas, too, I could wonder especially what preceded the large continuous spreads of Athabaskan and Algonquian in most of Canada and the northern US? (And also which of them is the newer one?) Was there ever anything to the effect of “Inland Tsimshianic” or “Inland Tlingit”, “Plains Iroquioan” or “Forest Caddoan”? Or turning to Oceania: how far west and east did the various “”Papuan”” language families (many of them even today not confined to just New Guinea) extend before the Austronesian / Malayo-Polynesian expansion? For that matter has anyone even tried comparing any of these with the other continental SEA languages in any capacity, or just assumed that they must have been in splendid isolation amongst each other linguistically effectively forever?

These are questions that, again, some experts might already know answers to or at least have hypotheses for. But nowhere is this information available in centralized geographic form, even though it would be surely possible to represent so, giving a kind of a bird’s eye view of what are the major ethnohistorical results achieved or confirmed by historical linguistics, and what questions still remain open.

[1] Faroe Islands seem to be better established than Iceland as having had a pre-Norse population (at least as of the Nature study just last December). A longer list of cases without a distinct local ethnicity includes e.g. the Azores, Bermuda, Falkland Islands, Svalbard, Tristan da Cunha (and also remote islands in the other oceans, e.g. Kerguelen). There are some more within-reach cases like the Andamans, Maledives or Nicobars, for which I’m not sure what’s known of their prehistory (though then already the existence of two Andamanese language families suggests that one of them is very likely older than the other).
[2] Not always the same family on top in all interactions: Turkic has been expansive over Iranic, while Russian has been expansive over Turkic … and yet Russian and Iranian are both Indo-European. It should be no surprize at all either when we find e.g. language shift from Swedish into Finnish in Finland, vs. from Finnish into Swedish in Sweden.
[3] Really if “Neosiberian” is taken to mean “the recent but pre-Russian arrivals”, and “Paleosiberian” as everything else in the area — then we ought to be counting Uralic as the largest representative of the latter, not as some European family that somehow just happens to be also present. By now we do know the westernmost expansions of Finnic, Samic and especially Hungarian to be relatively recent, while Uralic or pre-Uralic presence in western Siberia has no established terminus post quem (short of the hard geological limit of the last ice age). — I suppose the usual exclusion of Uralic from “Paleosiberian” has been instead more informed by its typological similarity with Turkic and Tungusic. But then this seems improper when the term is Paleosiberian, not “Non-vowel-harmonic-siberian” or anything else of that sort.
[4] Checking with a recent monograph from Bender instead shows some very uncompareable-looking terms in most of Cushitic, such as Oromo /saʔa/, Konso /lawaa/, Agaw (North Cushitic) *lɨw-, South Cushitic *ɬee; or does Blench have some supposition about a Northeast Caucasian-esque *ɬ > *g?! — Further north, *gʷow- ‘cow’ in Indo-European does look amusingly similar to Khoe, but Afrasian is bit too wide and old of a family (definitely older than the domestication of cattle, which “only” dates to ~10,000 years BP) for me to think that there could be a connection entirely without it. Even something like the mysterious Y-DNA haplogroup R-V88, common in central Africa around Lake Chad yet seemingly derived from Eurasia, doesn’t really allow any connection that would reach all the way to southern Africa.

Tagged with: , , , ,
Posted in Methodology