My previous example of phonotactic combination analysis was on data that was, despite a few kinks, still largely homogenous. But to showcase how it’s important to have a decent basic hypothesis before going into more fine-grained analysis, here’s a look at a rather different dataset. These are the medial consonants and consonant clusters from the inherited Proto-Khanty lexicon, again per Honti’s data (words with cognates elsewhere in Uralic but absent from Mansi are not included).
Some notes about notation etc. though, before I go on.
- 1st medial consonants (“C₂”) are listed down. Possible 2nd consonants of a consonant cluster (“C₃”) are listed across.
- I have analyzed PKh *ə as an epenthetic, non-phonemic segment that is inserted in “difficult” consonant clusters in, roughly speaking, stem-final position. E.g. *peLəm ‘lip’ = underlyingly /peLm/. Without this analysis I would be almost comically short on data.
- *g and *x mark two segments that only contrast in Western Khanty in back-vocalic roots (as /w/ versus /χ/). Honti conflates both as *ɣ. The contrast is not (directly?) recoverable in front-vocalic roots, nor in words that have been retained only in Eastern Khanty, and seems to have been absent from the C₃ position. I have counted ambiguous cases under *g¹.
- *L and *Ľ are cover symbols for laterals. PKh had a contrast between a fricative *ɬ and an approximant *l, and might have had even a similar contrast among the palatal laterals, but this is not recoverable in the medial position. (By contrast, the retroflex lateral *ɭ was quite certainly an approximant.)
But without further delay, here is what things look like in this part of the word root — sorted by frequency, again:
Already one look at this table should tell us though that it would be pointless to compare it against what an assumption of random distribution would predict. Not only are there way too many gaps, there are also several strong correlations apparent. Take for example C₂ *ń and C₃ *ć, which are both found almost solely in the cluster *ńć.
So the first step ought to be determining some basic background rules of phonotactics first. Here is the same data, now sorted by place of articulation instead:
Several qualitative patterns are clear by now.
- Almost all of the action goes on in the “edge” cells — those combining peripheral (bilabial/velar) and coronal (dental/alveolar/retroflex/palatal) consonants.
- Nasal + stop/affricate clusters (highlighted in pink) are easily the most frequent type of homorganic clusters. For bilabials, palatals and velars they are the only attested cases.
- There is a degree of coronal harmony: dentals/alveolars, retroflexes, and palatals do not combine with one another.  For the sibilants, nasals and laterals, this is exceptionless. The rhotic *r and the semivowel *j tolerate some exceptions, perhaps due to how the two lack counterparts at other POAs. One case with *-ćt- is attested, namely *kaćtə- ‘to hit’ — and in Northern Khanty only, actually. This is also one of the clusters that’s demonstrably secondary, as comparison to Mansi *këëćk- indicates that the word is to be segmented as *kać-tə-. Perhaps we can assume that in Proto-Khanty, this cluster still remained impossible.
- Geminates are uniformly forbidden.
More detailed frequency analysis should probably focus just on the areas that show no obvious restrictions of this kind. And now we can easily pick out a subset of data suited for this:
The data’s still a bit scarce, but here the distribution’s at least more randomized. And hence signs of various “minor” historical developments are now able to better stand out. Plus: note that despite my presentation, this is not really two separate datasets — it’s a single, three-dimensional dataset, with cluster order as the 3rd dimension. We can for example note the disproportionally high count of *-x(ə)L- compared to a disproportionally low count of *-L(ə)g-, almost certainly an indication of the regular metathesis of PU *lk and *sk in Khanty.
A full analysis would again be much more work than I am going to just blog out on my free time, though. I have no dout that this general type of methodology, applied to any one given language, could produce a small monograph’s worth of results…
 A result very similar to this has been noted already by Eugene Helimski in 2002: an incompatibility of the dentals *n *t vs. retroflexes *ɳ *č in word-initial vs. word-medial position. See: “Eine Regel der Konsonantenkompatibilität im Ostjakischen”, in Veröffentlichungen der Societas Uralo-Altaica 57.
It is obvious that there were no restrictions on initial palatals though, as shown by e.g. *ńoL ‘nose’, *ńoLt- ‘to knead’, *ńeeL- ‘to swallow’, *ńuuɭəm ‘wound’, *ńaLkïï ‘Siberian fir’, *ńaaL ‘arrow’, *ńeLää ‘four’…