On a whim, I’ve started to investigate the lexicon of Proto-Tungusic, which the Moscow school of Nostraticists maintain a handy database of (as they do for pretty much all Eurasian language families).
I am currently about 10% in, having looked thru (and transferred into a spreadsheet for further analysis) all roots beginning with *a, *ā, *b and maybe half of *č. Interestingly though, there are already a couple of clear signs that the analysis is not exactly reliable, even without me knowing anything about any Tungusic language in detail. In some aspects things appear to be even, quite simply, terribly wrong.
In particular, one obvious argument stands out against the Altaic hypothesis, at least in the strongest form as advanced by the authors: around 98% of the words in the database (so far, 235 out of 240) are traced in some form back to Proto-Altaic.
So, Tungusic, despite being a family bordering unrelated languages on several sides (Nivkh, Chukotko-Kamchatkan, Yukaghir, Sinitic), and distinct enough also from its supposed relatives that no generally accepted protolanguage has been so far reconstructed — is regardless supposed to contain less than 3% non-inherited material in its reconstructible vocabulary! All substrate loans, all proto-language era loans, all areally widespread loans, all coinages, all onomatopoeias, all words that have semantically diverged so far that their ancestry has become opaque: these categories are supposed to wholly fit among the five allegedly non-Altaic word roots I have down so far. I guess the people responsible for this project haven’t grasped the idea that there even is a typology of etymology that they are violating.
Now, sure, if “Proto-Altaic” was brought forward as a synchronic grab-bag of word roots and typological features that are just found in some shape across a wide area in central to northeastern Eurasia, this all would not necessarily be a problem. We’d just call it a work in progress, and hope for eventually sorting out which words indicate Mongolic loans in Tungusic, which ones Para-Tungusic loans in Korean, which a mutual substrate in Turkic and Tungusic, etc. Yet as far as I can tell, no self-professed Altaicist takes this stance.
It somehow gets worse from here yet. A preposterous amount of the time, words are reconstructed to Proto-Tungusic on the basis of only a single language, plus external parallels. The typical language in the family seems to retain about 40-60% of the original vocabulary (you may wish to compare this against the previous number). If the vocabulary had later only been subject to random loss, we’d expect that words surfacing only in one language (out of ten, as per the database’s analysis: Evenki, Even, Negidal, Manchu, Ulcha, Orok, Nanai, Oroch, Udighe, Solon) occurred about 0.5^10 ≈ 0.1 % of the time. Guess how many actual cases the current sample includes? 34, ie. about 14%. An additional 17 roots (~7%) are then limited to a single sub-branch of the family, e.g. Northern Tungusic.
This kind of a discrepancy might still be excusable, if this were a Turkic-type situation — a family where one of the main branches is currently only represented by a single language. In such a case, any word that had been lost in the “main” branch could well have been still retained in the “minor” branch. But that won’t work here: the isolated vocabulary is scattered over several languages, including especially both far ends of the family (Evenki in the north, Manchu in the south), and occasional cases from most other languages as well.
Whether there are issues in the actual raw lexical data though, I couldn’t tell, but it’s cited from a decent variety of sources… so at least there should be no reason to suspect a systematic heterodox methodological bias.
Of course, knowing that there is a problem is not equivalent to knowing how it should be fixed. The latter will take a bit more work than a single blog post, I am sure. One path would be the traditional etymological approach: to just wade in and start noting comparisons that are phonetically or semantically dubious, and see how much that takes care of. But, there are other options as well that might turn out more effective. E.g. zooming in on material that phonologically stands out (possible loanword phonemes and similar features) would perhaps lead to something. I moreover have in mind, one step more quantitative yet, a relatively simple statistical check-up: correlating the internal Tungusic distribution of the word roots to the external distribution of their Altaic parallels. E.g. if a substantial number of loans to/from Mongolic have been here misinterpreted as inherited, I’d expect a language such as Manchu (neighboring Mongolia) to contain more of these than a language such as Negidal (by the Sea of Okhotsk coast)? We’ll see. I will have to do a separate sweep of the Altaic database later to log this info, and I still have quite a while to go here as well.