Phonotactic analysis is probably one of the most straightforward tools for statistical etymology. There are others too — but this is an analysis method that will easily bring up a wealth of data that has no real synchronic motivation (arbitraryness of the sign, once again) yet can be assumed to reflect all sorts of historical processes of language development. Usually though in more or less fossilized form, perhaps even quite deeply so.
However, when the object of the analysis is a reconstructed protolanguage, also another option becomes available. This is to take significant quirks as instead suggesting points on which the reconstruction itself could be improved. A reconstruction is not primary data! It is allowed to make argued-for adjustments in just what the reconstruction is in the first place. (Alas, not realizing this is a somewhat common failure mode in studies mixing synchronic analysis methods with reconstructed data.)
For an example of this approach in action, here is a sneak peek at one dataset I am massaging:
This table shows the co-occurrence of initial consonants and following vowels in the common Ob-Ugric lexicon, as reconstructed by Honti (1982). Since this is for the sake of an example, at this point only some small adjustments in the reconstruction have been added, nothing major. The various non-integer values are due to me splitting most reconstructions that show uncertainty in their reconstruction: e.g. the root listed as *keej-/*kööj- ‘to lek’ has been tabulated as 0.5 *kee-, 0.5 *köö-. An exception to this though is the correspondence type marked by Honti as “uu/ïï” which actually outnumbers several allegedly regular vowel correspondences, and seems to deserve a line of its own.
“B”, “BB” and “FF” moreover indicate correspondences that are sufficiently irregular that Honti has only dared to report if the data points towards a back or front vowel, and a long or short vowel.
So the question is: might we be able to determine if there is anything odd going on here? For just one example, while roots with zero onset are quite abundant, there seems to be an absence of roots beginning with *o-. But then again, random holes occur elsewhere in the table as well. So is this a sign of something being wrong with the reconstruction? a reflection of some earlier soundlaw in the development of Ob-Ugric? or perhaps, of nothing at all? Hard to say using only qualitative tools.
Forming some simple quantitative predictions from this type of data is however not hard. For a first approximation, say we assumed a fully random distribution of roots, with no interdependences in the occurrence of consonants vs. vowels. In this situation, the expected number of roots beginning with a given *CV- sequence could be calculated from just the total vowel and consonant frequencies. For example *-ää- occurs in 44/724 ≈ 6.1% of the roots; *ɬ- occurs in 53/724 ≈ 7.3% of the roots; their predicted co-occurrence is thus 0.061·0.073 ≈ 0.44% of the roots, i.e. the expectation value of roots beginning with *ɬää- is about 3.2.
Algebraically, the formula for this expectation value comes out as C·V/A, where C is the attested count of the onset, V the attested count of the vowel, and A the number of roots altogether.
The actual number of attested roots beginning with *ɬää- happens to be indeed 3 (*ɬääpət ‘7’, *ɬäärəɣ ‘ruffe’, *ɬäärɣət ‘hard’). So in this case the prediction is spot on! Many of the other CV combinations seem to work this well too, “off” by about 1 at most. But larger deviations also can be found. Here is the full table of differences between the attested and expectation values, with some color-coding applied:
As an initial observation, note the gradual accumulation of random holes and peaks: a lesser number of roots are off by about 2, even fewer off by about 3, etc. Also unsurprizingly, bigger deviations are mainly found towards the upper left, where the data is denser.
At this point we could continue quantitative analysis. Making various starting assumptions about expected variance in the vocabulary and then doing a whole bunch of math would probably be able to tell us if the general patterning of the data shows statistically significant deviations or not. But… this seems like a bit too much work. For one, parts of the table would end up having to be recalculated if we were to adjust the underlying reconstruction even just a bit (e.g. by splitting a given proto-vowel in two). And for two, it is not at all obvious what should be our default hypothesis! It is already known that languages tend to prefer some phoneme combinations over others. And yet, AFAIK, a universal typology of this has yet to be developed even qualitatively. Applying detailed rigorous methodology while relying on guesstimated background assumptions would be a waste of effort.
Instead, I think at this point a qualitative human intervention can already tell us how likely is it that there is anything interesting going on here at all. Rather than aiming for assessing every single entry, let’s check out just the lowest-hanging fruit. The 5 most aberrant *CV- sequences in the data are:
- *wuu: +9.0
- *kuu: +7.4
- *ää: +6,9
- *mee: +5,2
- *kää: -5.0
Since my initial point is to demonstrate that calculating phoneme co-occurrence rates among a proto-language’s lexicon can reveal evidence for adjusting the reconstruction, then surely this sort of evidence should be found in this end of the data, if at all.
And indeed, it looks like that at least the first case is not an accident. In part it probably reflects the fact that the contrast between *uu- and *wuu- is not very clearly indicated in the data at all. Most Ob-Ugric varieties have lost *w before rounded vowels; and some others like Pelymka Mansi and Kazym Khanty have by contrast introduced an epenthetic *w before some rounded vowels. In other words, we may already suspect that having as many as nine roots “too many” indicates that some of Honti’s *wuu- roots here should be actually reconstructed with plain *uu- instead.
A look at Southern Mansi suggests a few good candidates. These are the words where Honti assumes shortening *uu > *u in Mansi (although this is a change he does not really present any conditioning for):
- #668 *wuuj- >> SMs oj- ‘to swim’ (~ Pelymka wuj-, Kazym wooś-)
- #682 *wuulɜ >> SMs olā ‘pole’ (~ Pelymka wula, Kazym wooɭ)
- #689 *wuunč- >> SMs onš- ‘to run over’ (~ Pelymka wunš-, Kazym wuš-)
- #708 *wuur >> SMs or ‘edge’ (~ Kazym wur)
- but: #706 *wuur >> SMs wor ‘possibility, way’ (~ Kazym wur)
This looks like Southern Mansi may actually have maintained a contrast between *w and zero in this environment. And, better yet: Honti also fails to list any examples beginning with (zero onset plus) *uu that would have any potentially incriminating reflexes at Pelymka, Kazym, or other similar dialects. So there seems to be no obstacle to adjusting the reconstructions to *uuj- ‘to swim’, *uulɜ ‘pole’, *uunč- ‘ to run over’, *uur ‘edge’. In the case of ‘to swim’ we can even verify this with external evidence! Consider Permic *uj- ‘to swim’. Normally Permic should retain evidence of *w even before rounded vowels (as in Finnish uusi, Hungarian új ~ Komi выль /vɨlʲ/ ‘new’), but no such thing appears here.
Recognizing w-epenthesis also allows cleaning up #701 *wuupɜ ‘older sister’, where *w seems to have again been posited only on the basis of Pelymka wuup. The Khanty reflexes like Tremjugan oopïï, Kazym opi, Obdorsk apii do not support positing *w- at all. Neither does the Proto-Samoyedic cognate *apå. By external evidence, #688 *wuunč ‘nelma’ (a type of salmon) similarly seems to be a case of secondary *w: contrast Proto-Samoyedic *ånčɜ, Komi удж /udž/.
— Moreover the above type of scenario is not the only possible kind of explanation for why a particular sound sequence might be non-randomly overrepresented. A different issue seems to concern the following two words:
- #659 *wuuč ‘town’
- #660 *wuučəm ‘weir’
Wider Uralic etymological references generally consider these words to be based on one and the same root. Cognates such as Northern Sami oahci ‘barrier, obstacle, reef’ or Tundra Nenets ва” /waːʔ/ ‘fence’ seem to point to the original basic meaning having been simply ‘fence, obstacle’, from which the two attested meanings are easily derivable. Perhaps also #657: *wuuč- ‘to fish’ is a part of the same bundle. Honti indeed even includes small footnotes in the lexicon commenting on the possible relationship of these three words. It’s not clear to me why he regardless lists them separately.
Altogether at least eight of the roots where Honti reconstructs *wuu- seem to be superfluous in some sense. A pretty good catch for such a simple statistical tool, so far.
I’ve only taken a more casual look at the other top-5 cases, but some instances of *kuu- also might be illusory. More briefly:
- #229 *kuuďmɜ ‘ashes': according to a recent proposal from Ante Aikio, this would be a derivative of the root listed by Honti as #227 *kuuď-/*kïïď- ‘to disappear’.
- #261 *kuulpɜ ‘net’ is generally considered an old derivative of #245 *kuul ‘fish’.
Some less directly apparent phenomena may also have shaped the data. For one, I have here only charted out the co-occurrences of initial consonants + initial vowels. Perhaps a look at medial consonants, or the few stem vowels that are found in the data, would turn up other results. In theory it is even possible that some initial *CV- effects are the secondary product of sound changes involving medials instead. Suppose initial X had some interaction with medial Z, and this then had some interaction with vowel Y; this would already suffice to generate a correlation in some direction between X and Y. Hence, with this mode of analysis, it seems efficient to attack the data from multiple directions. Take a couple of snapshots from different angles, look thru the biggest problems that come up, recalculate the results after any adjustments… and see if this then brings to highlight any new issues.