Some thinking out loud on the formalization of comparative and historical phonology.
As in most work I’ve seen on the topic, I presume that an etymological corpus of word comparisons has already been given, additionally also aligned segmentwise.  The usual question at this point is how to proceed with reconstruction. I however largely assume even this as as given. The main questions I would ask are: how much should we trust a reconstruction given for the data? How coherent it is internally to begin with, and how does it match against other reconstruction possibilities?
This is not a very relevant question for developing automatic reconstruction methods,  but better understanding of these issues will be practical in assessing existing proposals. Especially the ones that cover substantial amounts of data but are regardless disputed on every front, e.g. any variant of Altaic or Nostratic.
Foundations and Cores
The basic concepts of this post:
- A phonological foundation is a set of word comparisons where every sound correspondence is regular within the set.
- A phonological core is a minimal phonological foundation, i.e. a phonological foundation such that no strict subset of its word comparisons is a phonological foundation anymore.
Note that these definitions are only with respect to etyma, not with respect to the number of reflexes. A comparison of two-reflex etyma could be exactly as regular as a comparison of ten-reflex ones; a compact foundation comparing only two languages could be exactly as regular as as diffuse foundation comparing ten languages. For now, all correspondences still have to be regular between all applicable language pairs specifically. 
These concepts have been phrased purely in terms of sound correspondences. Actual reconstruction requires consideration as well, though. A few initial definitions for this:
- A reconstruction is a set of word comparisons between at least three languages, with exactly one of them being a special type of language called a proto-language, with the following properties:
- Every comparison includes a proto-language cognate (called a proto-form).
- The proto-language is not given by external data, but can be adjusted at will.
(I.e. this is the “operational” proto-language, not the inferred “real historical” proto-language. By this definition, Latin is not Proto-Romance, at most identical to Proto-Romance.)
- A historical phonology is a partially ordered set of sound changes (I will not go here into rigorously defining a sound change) with the following properties:
- Sound changes are ordered with respect to one another only if they interact ≈ roughly: take the same segment as input or as conditioning.
(I.e. we abstract away the difference between historical phonologies that differ only in the relative chronology of changes that do not interact. In the absense of other details, *śëta > *śata > sata and *śëta > *sëta > sata should be considered identical histories.)
- The bottommost sound changes start from the proto-language.
- The topmost sound changes yield the other languages as recorded in real data.
- For any sound change applying to all languages, there is at least one sound change postdating it that does not apply to all languages.
(I.e. the proto-language is still indeed the last common ancestor, not merely any common ancestor.)
The latter could be better called a “comparative historical phonology”… since real historical phonologies often take additionally also loanword evidence into account when establishing relative chronologies. And we could also define internally reconstructed historical phonologies that replace condition 4 with a redefinition of the proto-language. Gotta learn to walk before running, though.
I have on purpose defined these two concepts without referencing the concepts in my first list. It is more profitable to instead treat these as orthogonal, and to speak of concepts such as a foundational reconstruction = a reconstruction whose underlying set of word comparisons, the proto-language excluded, is a phonological foundation. There are many proposed reconstructions, and we do not want to suggest that they are by definition regular in my highly formal sense! As seen below, perhaps they do not even need to be.
Some concepts in hand, let us now go over a simple example. One clean phonological core within Uralic is presented by the following four etymologies between Finnish, Northern Sami and Erzya:
- Fi. kesä ~ NS geassi ~ Er. кизэ /kize/ ‘summer’
- Fi. pesä ~ NS beassi ~ Er. пизэ /pize/ ‘nest’
- Fi. kala ~ NS guolli ~ Er. кал /kal/ ‘fish’
- Fi. pala ~ NS buolli ~ Er. пал /pal/ ‘bit’
(I have stuck here to the most widely spoken members of their subfamilies. The data could be easily also stretched to include further varieties, or rewritten as a comparison of Proto-Finnic, Proto-Samic and Proto-Mordvinic.)
We can easily see that everything is regular: every sound correspondence occurs at least twice — in fact exactly twice; cores with correspondences occurring thrice could only be put together from more holey data. Reconstructions would be easy to suggest too. A phonetically simple approach would be e.g. *kesä, *pesä, *kală and *pală, which is only mildly off from the usual thinking. 
However, the data here in effect only allows reconstructing two onsets *k-, *p- and two rimes that I have just called *-esä, *-ală. It does not establish any contrast between the individual segments in the rimes! This means that given just this data, we could also rewrite the rimes in more minimal forms such as *-ele, *-ale, and assume a number of conditional sound changes that apply in all or most descendants (e.g. *l > ⁽*⁾s / e_ in all, *e > a / aC_ in Finnish).
This hence already demonstrates that reconstructions should not be built up from “core reconstructions”: overly limited data leads to overly minimal reconstructions. A four-comparison core is not quite the smallest possible,  but obviously most of a realistic proto-language still cannot fit into one. Reconstructions with real phonological labels should probably wait until we have assembled larger phonological foundations — within cores, this work is adequately substituted just by the correspondence patterns themselves.
This phonological core is incidentally also a “semantic core“, with each of the four comparisons showing the exact same meaning in every language. This is probably also a desirable trait in phonological foundations in general, but then not strictly required by the phonological formal side of the Comparative Method.
Using the concepts of phonological foundations and cores, I can now also define a few categories of word comparisons:
- A word comparison that belongs to at least one phonological foundation is regular.
- A core comparison is a word comparison that belongs to at least one phonological core.
- A regular adduct is a word comparison that belong to at least one phonological foundation, but does not belong in any phonological core (is not a core comparison).
- A word comparison that shows regular sound correspondences as established by a phonological foundation, except for one unique sound correspondence, is near-regular.
- Given a reconstruction, a single near-regular comparison that does not contradict any soundlaw establishable from the reconstruction without this item is (phonologically) nonprovable; as is the new, once-exemplified soundlaw it requires.
- Given a reconstruction, a near-regular word comparison that does contradict a soundlaw establishable from the reconstruction without this item (would necessitate setting up a new proto-phoneme) is an exception.
- If there are two or more nonprovable comparisons, such that they are compatible with the same foundation, but only one of them can be added to the foundation as nonprovable (forcing the others to be exceptions), they are competing.
- A word comparison more irregular than a near-regular one (with at least two correspondences that are not regular) is simply irregular. We could distinguish further categories such as “2-irregular”, “3-irregular” etc. (with “1-irregular” being what I have just titled “near-regular”), but in practice the case seems to be that, for any sensible morpheme length, already 2-irregular correspondences are too weak to be very useful for linguistic reconstruction at all.
The first and third points of case #2 may sound confusing; in practice it means simply the case where we do not have enough data to establish what the regular reflex of a proto-phoneme *X in language L might be.
Note that a comparison may be at one stage with respect to one phonological foundation, at another with respect to another.
Continuing the previous example, my above-detailed core motivates hunting for other words displaying the same correspondences: Fi. p- ~ NS b- ~ Er. п-, Fi. -a- ~ NS -uo- ~ Er. -а-, etc. The current core is “closed” in the sense that adducing any one additional and different comparison from among the known (West) Uralic comparative material cannot produce a new foundation. Any new item will have to remain nonprovable until at least one other item has also been added. Purely in theory, it could accept a comparison such as Fi. kala ~ NS gilluo, mixing sound correspondences from different “slots”, which would perhaps prompt reconstructing something like *kålä for ‘fish’, *kälå for this. However, across the Uralic languages it happens to be the case that every complete sound correspondence (a correspondence pattern; see below) is strictly restricted to a particular position in the word. 
To pick out one new datapoint: Fi. vala ‘oath’ ~ Er. вал /val/ ‘word’ (with also Sami cognates, not found in NS though) can be seen to be regular save for the initial, i.e. it is nonprovable with respect to this core. Minimally it could be proven to be regular by also identifying a West Uralic **vesä. No such comparison is known though (and indeed no words ˣvesä, ˣveassi, ˣвизэ exist in any meaning at all in the three languages); hence more data still is required. One way to do it would be to adduce also the following two comparisons:
- Fi. kesi ~ Er. кедь /keď/ ‘skin’
- Fi. vesi ~ Er. ведь /veď/ ‘water’
These two allow reconstructing a third rime *-eti; vala ~ вал and vesi ~ ведь allow reconstructing a third onset *v-; and the rime *-ală and the onset *k- we knew already. Hence all is again in order. Notice that by now we have not only proven vala ~ вал to be regular: it is indeed as much as a core comparison, since also the set *kală, *vală, *keti, *veti constitutes a core!
As noted above, this new second core also only works between Finnish and Erzya. In any Sami variety, clear cognates exist only for *vală and *keti. They will remain nonprovable until we have adduced even more evidence to establish the regular Samic development of *-eti rimes, or more generally *-eCi rimes, and of *v-. However, forms like Southern Sami vuelie /vʉelie/ ‘a joik’ or Skolt Sami -kõtt ‘skin’ regardless already suggest what to look for. This evidence is not hard to locate either, e.g. in Fi. veri ~ NS varra (SS vïrre, SkS võrr) ~ Er. верь /veŕ/ ‘blood’. Though this in turn will send us on a lookaround to establish a few other things as regular, most prominently the development of *r in all involved languages and of *t in Sami; secondly also to verify the correspondences v- ~ v- ~ v- and ï-e ~ a-a ~ õ-∅ between our three Sami varieties that have come up so far. All doable too of course. What we can however see already well enough is how extending foundations is not a question of linear progress.
A methodological problem that emerges here, once variable amounts of languages per etymology are being compared, is that regularity for every pairwise comparison may be too much to demand. If e.g. between Finnish and Hungarian there is insufficient evidence to establish a correspondence such as s ~ gy as regular at all (the only example of this is ‘urine’: kusi ~ húgy < PU *kuńćə), but at the same time s ~ /ź/ and gy ~ /ź/ can be both established in comparison with Komi (or s ~ /ńś/ and gy ~ /ńś/ in comparison with Mansi, etc.) — is this not good enough? It feels to me that we should not have to choose between either Finnish, Hungarian, or the still quite regular ‘urine’ etymon to include in a good phonological foundation of Uralic.
Lone pairwise comparison is not good enough for everything, on the other hand. This would make it much too easy to set up some “straggling” members in datasets, such that Chuvash maybe has regular correspondences with Mari but not with the rest of Uralic.
Any pairwise segment comparison still only either is or isn’t regular, and I’ve already defined grades of regularity for an individual pairwise word comparison as well. Even further grades of regularity can regardless be defined, first for individual segments as considered across the entire dataset:
- Complete regularity: a segment whose every pairwise correspondence across a foundation is regular.
- Biconnected regularity: a segment whose graph of pairwise regular correspondences across a foundation is a biconnected graph (cannot be split into two independent graphs by the removal of one “bridging” language from comparison).
- Connected regularity: a segment whose graph of pairwise regular correspondences across a foundation is a connected graph.
- Soundlawful regularity: a segment whose every correspondence with the proto-language (hence in a reconstruction, not just any foundation) is regular.
Anything less than #4 is obviously no more regular at all, and instead at most semiregular: since we can attempt to provide a proto-form for every word comparison, gaps cannot be a problem for soundlawfulness.
#4 is, in fact, weaker than even #3. Assume that we had only two examples of the development of *k > /k/ in a poorly known (or just heavily divergent) language such as Muromian. This could arguably suffice to establish the reflex as regular. But then if these two etyma had no overlap in what other languages they have reflexes in, they would not establish any regular correspondence between Muromian /k/ and any other attested Uralic language. I believe it is even possible to create, between no more than three languages, a highly degenerate counterexample dataset that is soundlawfully regular but none of the pairwise sound correspondences are.
Another way to create a highly degenerate but soundlawfully regular dataset is to simply pool together two disjoint foundations — say, with data comparing Mari and Chuvash as one component, data comparing French and English as another. This would still suffice to show that *k- > /k/ is a regular sound change in each language (just not that it is the same *k in all cases…). This is clearly absurd as one proto-language though. I suppose global connectedness regardless of regularity should be required anyway.
In large datasets further grades between #1 and #2 could also prove useful. I do not have the intuition to immediately identify them, though (“triconnected” comes to mind as a naive proposal, but if it would really improve anything much is not clear to me ). Before that though, we can consider what are sensible options for datasets with only a small number of languages. For two languages, soundlawful regularity already equals complete regularity. For three, biconnectedness does the same. For four, a double triangle graph is a possible more than biconnected but not fully connected option. But then it’s also already vulnerable to stragglers: e.g. we can find regular correspondences between Swedish, Finnish, Karelian and Erzya, just not between Swedish and Erzya specifically. More Finnic languages could also be added into the mix to create even more highly connected correspondence graphs that still have the same problem. To eliminate this problem, but not the case of *ńć in Finnish vs. Hungarian, perhaps it suffices to demand the existence of some regular correspondences between all pairs of languages (even if not all pairs of segments).
Did you notice an assumption that I have snuck in unsaid above? It is that pairwise segment correspondences could be linked together into single distinct graphs of a segment’s correspondences. Actually though, this is not trivial, and already constitutes some basic work towards a reconstruction. I can define a few concepts related to this too, while I’m at it. It gives also a a first dip into the topic of conditional sound changes and conditional sound correspondences.
- Given a set of word comparisons covering at least three languages, and with at least some word comparisons not covering all languages, a correspondence pattern is a grouping of pairwise sound correspondences that assigns a reflex for every language and assigns the pairwise sound correspondences of one multi-language word comparison into the same group.
- A correspondence pattern is fully attested if every language appears at least once within it.
- A correspondence pattern is complete if every one of its pairwise sound correspondences occurs in at least some word comparison.
- (I could again define also biconnected, connected etc. correspondence patterns as weaker options, but I am not sure if this is necessary.)
- A correspondence pattern is well-supported if there exists a word comparison that displays every member of the correspondence pattern. We could also call this “1-supported”, and define “n-supported” as the minimum number of word comparisons that displays every member.
- A pre-reconstruction is, in turn, a grouping of either binary sound correspondences (if between two languages) or correspondence patterns (if between more than two languages) by positional environments. This could be further split into a few subtypes too, e.g. conditioning by daughter-language phonetics or conditioning by proto-language phonetics. Already a single correspondence pattern, though, could also constitute a (fairly trivial) pre-reconstruction. — It should probably be demanded that a pre-reconstruction only unites correspondence patterns that have some overlap in their reflexes, not arbitrarily different ones.
- An unlabeled reconstruction is a set of pre-reconstructions that covers every sound correspondence within a comparative corpus. (Two-language comparisons could be always trivially considered to be unlabeled reconstructions.)
- An unlabeled reconstruction is fully reflected if every pre-reconstruction contains a fully attested correspondence pattern. (In the case of highly split correspondences, we might not want to demand this of every minor correspondence pattern.)
- Likewise, an unlabeled reconstruction is well-supported if every pre-reconstruction contains a well-supported correspondence pattern; etc.
Note that while an unlabeled reconstruction covers the entire system of correspondences in a given corpus of word comparisons, pre-reconstructions are segmentwise, per one alignment “slot” at a time (and this could maybe use a better term; “unlabeled segment reconstruction” doesn’t strike me as progress though). Also, as we already established in the previous section, correspondence patterns cannot be simply classified as “regular” or “not regular”. They are “once-soundlawful” by definition, but not anything more.
Continuting working with the example from above, Fi. v ~ NS v and Fi. v ~ Er. в are sound correspondences; Fi. v ~ NS v ~ Er. в is a correspondence pattern that combines them, and moreover suggests also the existence of a third sound correspondence, NS v ~ Er. в. Once we observe that this correspondence pattern occurs exclusively word-initially, it can be combined with also a corresponding word-medial correspondence pattern (Fi. v ~ NS vv ~ Er. в) into a pre-reconstruction: Fi. v ~ NS v-/-vv- ~ Er. в.
There would be other options, e.g. to combine the word-initial pattern with a different medial correspondence pattern that it also overlaps with: Fi. v ~ NS kŋ ~ Er. в. Note that we usually choose the first option (and say that they reflect Proto-Uralic *w, while the second reflects PU *ŋ) primarily due to the greater phonetic similarity. It would be entirely possible to shift them around in the reconstruction, to claim that PU *w nasalises to *ŋ in Samic (etc.), but that there was also a segment *hʷ that occurs only medially and always lenites to *w or similar. This is in all respects exactly as regular as the usual reconstruction with *w and *ŋ; it only does worse in terms of how natural the required sound changes are.
It’s also possible to advance an objection against the demand for sound correspondences to overlap before they can be combined in the same pre-reconstruction, even if it is clear that often combining non-overlapping sound correspondences would create nonsensical pre-reconstructions. Suppose a proto-language had some prominent allophonic distribution, e.g. between word-initial voiceless *[t] and word-medial voiced *[d]; but, independently in all descendants (e.g. perhaps due to later sound changes such as *st > /t/ or *ð > /d/), /t/ and /d/ have become different phonemes. Then, even if we take phonological and not phonetic data as out input — word-initial t ~ t ~ t and word-medial d ~ d ~ d will end up being two different correspondence patterns with no overlap between them.
Is this a problem? Not necessarily. It seems to me that unifying these as the same proto-phoneme is not a task of reconstruction: it is a task of the phonological analysis of the proto-language. That is, we see that reconstruction outputs “allophonemes”, not phonemes, possibly even in the case where the input is phonemic data. Due to this, it can be often a good idea to also not use phonemic but rather similarly “allophonemic” input data. Suppose now a case where medial voicing of stops has remained purely allophonic in all descendants of a proto-language: in this case, the allophony rule could be still reconstructed, but only if we do not first eliminate it from the data by collapsing [t-] and [-d-] into /t/.
(Despite these examples being simplistic, it is also the case that identifying nonphonological contrasts in a reconstruction is often not trivial. One of the more surprizing adjustments to Uralic historical phonology over the last few decades has after all been the result that, while traditionally reconstructed *oo and *ee do seem to contrast with *o and *e — they do not contrast with *a and *ä, and in fact have massive overlap with them in their reflexes, even if the proposed phonological proto-values are quite different. The solution to this has also not been to set up [a ~ oo] and [ä ~ ee] in an original allophonic relationship, it has been to recognize *oo and *ee as later innovations exclusive to Finnic.)
Lastly, a few statistical measures of pre-reconstructions that I can think of, which might come useful eventually.
- The multiplicity of the pre-reconstruction is the number of correspondence patterns it encompasses.
- The split count = S of the pre-reconstruction is the number of phonological splits it tracks. If a language shows N different reflexes across a pre-reconstruction, the split count of this language for this segment is N-1; the total split count is then the sum of these across the dataset.
- The expected multiplicity is 2^S. The real multiplicity can be often smaller, though, both due to similar conditioning in several languages, and due to gaps in the data where two conditioning factors by accident do not occur in any word (even if this would be theoretically possible). Some general positional considerations could be applied to calculate a better expected value.
Continuing on. Before we start adding too many near-regularities and irregularities on top of phonological foundations, it is worthwhile to consider how far we might be able to get with just them.
A naive guess could be that the best phonological foundation for some large family like Uralic consists simply of gathering as many phonological cores as possible, taking their union, and topping up with any regular adduct comparisons that fit into this system. I think this is probably a bad idea, though. I’ve bordered above on the problem that it is often possible to identify phonological cores that consist of loanwords. These can be not just loanwords to/from an outside source; they can be also inside a family, creating false correspondences. There might be also some small number of accidental cores out there, even. E.g. nursery words of the mama papa dada type will easily allow establishing a regular correspondence a ~ a between almost any languages in the world, and it would only take a few coincidences to end up being able to show their consonant correspondences to be regular too.
As established, one way to weed these off will be examining the big picture of the sound correspondences and demanding biconnected etc. regularity (essentially an argument from distribution). Another clear source of false positives though is that so far I have not been very strict in defining “regularity” to begin with: I’ve accepted mere recurrence of any kind as sufficient. Normally, two examples of a sound correspondence is actually only very feeble evidence!
My assumptions, previously unspoken, have been the following:
- If a linguistic relationship is real, then most sound correspondences will recur, over and over, within and between different cores, and build up naturally in this way once we start considering larger foundations.
- Sound correspondences come in an exponentially decaying longish-tail distribution, and that while some will end up recurring quite abundantly, most don’t.
The second is particularly because of conditional splits, which will divide any proto-segment across multiple correspondence patterns. Between all three of Finnish, Northern Sami and Erzya, there are some 40–50 examples known of the word-initial sound correspondence k ~ g ~ к, some 20–30 for the nextmost abundant examples like word-initial p ~ b ~ б (and it is not coincidental that neither of these consonants has been affected by any further conditional sound changes in any of the three languages); but for the most poorly attested regular correspondences, we indeed have to make with just two examples between just two languages, before fading into correspondences that are regular only when routed through some additional language, or regular soundlawfully but not by binary comparison, or only semi-regular, or irregular entirely.
Could we just require every pairwise sound correspondence to occur at least thrice, and then work with “3-cores” and “3-foundations” as the most reliable key evidence? This is probably possible between some closely related languages. I am however uncertain if there would exist any of these for wider Uralic at all. There definitely are not any neat and compact nine-item cores that look like *pala *pola *pula | *tala *tola *tula | *kala *kola *kula (analogous in structure to my four-item cores covered above). This is for two reasons: (1) given the “long-tailedness” of pairwise sound correspondences, it is unlikely to find many high-frequency correspondences co-occurring in a word comparison; (2) in Uralic in particular, word roots/stems are relatively long, 4–5 segments, which makes it even harder to find a word comparison that avoids all the rare-but-regular sound correspondences.
Maybe some other condition needs to be relaxed at the same time? E.g. counting things on the pre-reconstruction level instead. After we’ve identified a complementary distribution e.g. among the different Samic cognates of Finnish /v/, we could then recognize Fi. v ~ NS v- and Fi. v ~ NS -vv- as the same meta-correspondence, and so on forth… But this actually already pares things back to the level of mere soundlawful regularity: all soundlaws affecting some proto-segment are already encoded within the correspondence bundles of a pre-reconstruction, and only a phonetical label for the proto-segment is missing. And demanding more than two examples of a reflex is not too hard at all.
A better option is perhaps to instead use the fact that words are inherited as a whole. If a word comparison shows three highly recurring correspondence patterns and one more poorly attested but still regular one, the three first should also allow us to put more trust in the fourth not being accidental. We could even calculate the average regularity. To avoid high-frequency correspondences “covering for” too many low-frequency ones, though, this should also probably be the geometric mean, not the usual arithmetic mean.
It’s even possible to propose that wordwise average regularity (let’s abbreviate this to WWAR) should to some extent trump segmentwise regularity altogether. Consider again some case like Fi. kusi and Hu. húgy that is not perfectly provably regular. That we still “want to” relate them can be after all motivated also without reference to the other Finno-Ugric languages, or to any detailed semantic considerations, by how k- ~ h- is a highly regular correspondence. So is -i ~ ∅, though this is a bit too “morphological” to fully count.  u ~ ú is also attestable, if rarer  and without well-known conditioning factors.
Besides giving a natural way to incorporate nonprovable and exceptional correspondences into an “extended phonological foundation”, WWAR is a measure that has also a few further good features. For one it largely captures the fact that short CV and CVC comparisons are more vulnerable to chance resemblances. For two, inversely it allows putting a bit more trust in word comparisons involving consonant clusters, which often show some highly conditional sound changes ⇒ not highly regular correspondences. In a comparison like Fi. täysi : täyte- ‘full’ ~ Hu. tel- ‘to be full’ we can then rely on as many as four highly or at least reasonably regular correspondences (t ~ t, ä ~ e, s : t ~ l, i : e ~ ∅) and not have to worry about y ~ ∅ too much.
But I think that it is still also necessary to start with solidly regular foundations, since the frequency of a sound correspondence depends on the corpus of word comparisons. Adding kusi ~ húgy to a corpus of Finnish–Hungarian word comparisons, from the sound correspondence point of view, does not only add the one case of s ~ gy, it increases by one also the counts of the other three correspondences, i.e. makes them more regular still. This being the case, there could be a risk of “farming” some highly recurring correspondences from numerous exceptional or nonprovable word comparisons, and using these as the main workhorse carrying the reconstruction. This was one problem in Uralistics in the 19th and early 20th century: a good understanding of some of the stronger correspondences had been worked out (in particular among consonants), which were relied on to accept also all kinds of poorer correspondences (in particular among vowels). Similar examples occur elsewhere in the history of etymology too, I’m sure (insert cliché allegedly-Voltaire quote here).
Given a corpus of word comparisons that is known to include some crud, there should actually exist a sweet spot of a sort. Calculate the WWAR and also the average WWAR across the corpus; then prune the lowest-WWAR comparison(s) and see what happens to the average WWAR. Eliminating highly irregular crud should raise this metric. But also pruning everything down to just a single phonological core would leave the average WWAR at no more than 2. Somewhere, then, there will be a maximum average WWAR that will be in a sense the most regular sub-corpus that can be achieved. There can be multiple local maxima though (there definitely are “between” cores, again as per my above example with vala ~ вал), and I’d have to work through a larger example corpus in detail to see.
Defining WWAR for non-binary comparisons will be something to figure out later also. Would just covering all the pairwise correspondences work? Perhaps it does. E.g. we can note that the number of pairwise correspondences grows quadratically as the number of independent members in an etymological comparison increases, and so this metric would naturally capture the intuitive impression that widespread etymologies are stronger (increase average WWAR more) than narrowly spread ones are.
I can think of one further potential problem in approaching reconstruction primarily as collecting phonological cores. A particular etymology could be quite regular between a handful of languages, but not between others. Maybe some further cognate shows unexplained quirks, or in some language group there exists a proposed but very dubious maybe-cognate. This is very common across Uralic, probably in any deeper and wider language family really. How worried should we be if these cognates turn out to not fit into phonological foundations?
For a demonstration of the issue, a few examples from what I generally consider AAA-class Uralic vocabulary overall:
- *ëla- ‘under’: Mansi shows *jal- instead of expected **ëël-.
- *elä- ‘to live’: Mordvinic has for this meaning *eŕa-, irregular on every segment but phonetically fairly close regardless.
- *enä ‘big’: Komi has /una/ ‘many’ instead of expected ˣ/on-/. Udmurt /una/ is in principle regular, but per Komi this may have been irregular *una and not **ona already in Proto-Permic.
- *ďëmə ‘bird cherry’: Erzya shows /lʲom/ with unexpected /o/ and unexpected initial palatalization, Moksha shows /lajmä/ with unexpected intrusive -j-, and even a common Proto-Mordvinic form does not seem to be readily reconstructible.
- *ipsə ‘smell’: Hungarian has íz with irregular /z/ (maybe nonprovable as a reflex of *ps in particular). Moksha has /opəś/, irregular on every segment except *p.
- *jäŋə ‘ice’: Permic has *jë, with unexplained loss of *ŋ (which has parallels though, so technically regular) and an irregular vowel.
- *jëxə- ‘to drink’: the labial vowel in Samic *jukë-, Finnic *joo- is not really expected and has no exact parallels.
- *kajwa- ‘to dig’: Samic has *koajvō- instead of expected **kuojvē- or **kuojvō-.
- *kälä- ‘to wade’: Mansi *kʷääl- has an unexpected labialized initial, Khanty *küüL- unexpected height and labialization of the vowel, instead of expected **kääl- and **kööL- (or **käL-).
- *kätə ‘hand’: Mari has *kit instead of expected **ket.
- *kiwə ‘stone’: Udmurt has /kɤ/ instead of expected /ki/ (which does occur in Komi).
- *kulkə- ‘to go’: Hungarian has halad instead of expected ˣhol- or similar.
(Incidentally it is noteworthy that while there are some consonantal problems too, all of these cases show some vocalic problems.)
Regardless all of these etymologies show perfect soundlawfully regular reflexes in at least six other languages. At least the comparison of these is beyond any reasonable doubt. With these exception cases it’s however conceivable that some of them have in fact been adduced erroneously and should be treated as e.g. family-internal loanwords or as unrelated. 
There is also a smaller group still of completely and unambiguously clean widespread Uralic etymologies, including e.g. the above-considered *kala ‘fish’ and *pala ‘bit’. Should we perhaps prioritize these cases somehow when building up a phonological foundation? Maybe not. *kala and *pala both happen to lack known reflexes in Permic… If I were to propose some outrageously irregular reflexes from there, does this in any way weaken the other pairwise comparisons? The same really holds for more promising irregular reflexes too. As a reminder, the main point of the framework I am sketching in this post is to assess if a proposed reconstruction or system of correspondences is acceptable, or if it is better than some other proposal. That there remains more work to do is a different issue.
At other times still, a proposed etymology could have deeper fault lines, such as being more regularly considerable as two etymologies, possibly with a bridging member. These are also findable across Uralic, e.g. when western languages point to *kakta but eastern languages to *kettä as the proto-form of the numeral ‘2’. It is not clear to me what to do in such cases. They can still e.g. demonstrate branch-specific sound changes as long as we keep a leash on which pairs of languages are compared.
In any case, much like widespread sound correspondences, widespread etymologies are not all-or-nothing cases. They may be more regular between some languages, less regular between others. Single outliers or multiple equally distant ones will be easy to identify and possibly exclude at least. It is surely a problem if a proposed language family starts having substantial amounts of etymologies which only really work between a few languages and not any others, but this might well be a problem of etymological work and not of the relationship itself. It is hard to think of any formal justification for treating an irregular etymology as “too good to be rejected”. Substantial and intractable irregularity is a good reason to decide that a proposed cognate is just wishful thinking built on superficial similarity, or at very least too weak to build a foundation on, no matter how long e.g. its pedigree in etymological literature is. The best illustrations for this principle surely come from cases where a different, more regular etymology turns out to be possible after all. The classic of the genre is the superficial resemblance of Latin deus and Greek θεος. From Uralic, consider e.g. Livonian sūoŗ ‘vein, sinew’: by current thinking this is not a reflex of PU *sënə ‘id.’ (> Proto-Finnic *sooni, reflected in all the rest of Finnic) with irregular *n > *r (> ŗ /rʲ/), it is instead a perfectly regular reflex of a distinct but partly synonymous PU root *särä. As an older example I could mention the comparison of Fi. aivot ‘brains’ with Northern Sami oaivvi ‘head’, which appears in some 19th-century works before being replaced by the current comparisons: Fi. aivot ~ NS vuoigŋašak ‘id.’, Fi. oiva ‘proper’ ~ NS oaivvi (both fully regular even if less immediately apparent). Similar examples could be collected at least by the dozens from etymological literature. 
There is one failure mode of overcriticality around this area though; it is one where difficulties in reconstruction are confused with irregularity. E.g. as I’ve pointed out before, in Khanty the development of *kala, *pala differs from a third rhyme word *sala- ‘to steal’. But *ɬaaL- as the reflex of the third is not irregular in any sense I’ve defined so far! There are several cases of *a-a > *aa, hence also several cases of correspondences like Finnic *a ~ Khanty *aa or Samoyedic *å ~ Khanty *aa. The only problem is in our lack of understanding of the conditioning factors that lead to a double representation *uu ~ *aa. It would be possible to e.g. propose an entirely regular reconstruction of PU with two open back vowels, *a and *å, distinguished only in Khanty.
Where from here
Whatever the exact route, it would be a long and at many points tedious exercise to work up from small phonological cores all the way up to our current understanding of Uralic etymology and comparative phonology. This would be regardless illustrative, I think. If we repeated the process with a few other language families too, we might be able to eventually set up an objective metric for how phonologically (ir)regular some known or proposed language relationship really is. Also, just the largest achievable phonological foundation is probably not a good metric. My suspicion is that allowing any and all minimally regular correspondences, without constraints for their number, will lead to a vast ballooning of the system of correspondences that can take just about anything we throw at it (parallel loanwords, parallel derivatives, onomatopoeia…), and something like WWAR will be a much better metric of regularity.
There will be further technical issues to work out too, such as the effects of subgrouping and intermediate reconstructions (which could be used to define something like “phonological subgroupiness” also); the methods we use for identifying conditional sound correspondences; or adding typological constraints for the segment inventory of the proto-language or the sound correspondences we will tolerate (a correspondence like *n ~ *n should surely require less evidence to be acceptable than a correspondence like *m ~ *k).
 In language families such as Uralic, which I call “trochaic” or “left-rooted” (I should probably expand on this concept later on as well), alignment is really largely trivial: initial consonants or zero initials always correspond, first-syllable vowels always correspond, medial consonants and respective components of clusters always correspond, stem vowels in languages with bisyllabic roots always correspond. Complications start to arise only in corner cases like metathesis, initial-vowel syncope, or derivational suffixes added to CVC stems. Diphthongs and long vowels could provide some problems too, but then contractions like *ej > ii can be always also rewritten as conditional correspondences along the lines e ~ ii and j ~ ∅.
 Then again, what we are currently accomplishing with AI in fields other than linguistics suggests to me that automated linguistic reconstruction cannot be done right on the first try in any case. Any reasonably feasible algorithm most likely has to be based on generating a first pass and then iterating improvements to it. If we are good enough at the latter, it’s OK if the former is still fairly bad. This how real reconstruction also works, after all.
 In particular there are no phonological cores built out of comparisons covering only two languages but with the data altogether covering more than two languages: every two-language pair could be separated as its own core instead.
 What’s off is that the different treatment of the final vowels in Erzya is actually not due to any original difference in their strength, it is due to a recent and weirdly specific innovation syncopating *ə after *Cal. Unsyncopated forms have still been attested in Witsen’s 17th century records of Moksha.
 The absolute minimum is a comparison of two items with two segments that are the same in both, e.g. al, la in language 1 ~ er, re in language 2, or indeed, a comparison of two pairs of homophones.
 All single medial consonants are geminated in most of Sami in strong-grade positions (hence with sound correspondences distinguishable from initial consonants), all final vowels are lenited in languages like Mansi (hence distinguishable from initial-syllable vowels), all original consonant clusters are simplified or broken apart in Hungarian, etc.
 “Biconnected” can be taken to mean that between any two vertices, there are at least two mutually disjoint paths, or that the removal of any one vertex will not break the graph into two or more non-connected components. Upgrading these definitions to “three paths” or “two vertices” may not yield quite the same meaning for putative “triconnected” (clearly the former is stronger than the latter though).
 After all it has been proposed that in Finnish e-stem words, at least ones like this that have consonant-stem partitive singulars (kusta), only √kus- is really a part of the stem and -i : -e- is a prop vowel. Or even, -e- at least: another possibility, probably more provocative, would be to claim that -i is a nominative singular ending.
 Traditionally known proposals include pura ‘drill’ ~ fúr ‘to bore’, suippu ‘point’ ~ csúp ‘point’, survoa ‘to mash’ ~ szúr ‘to pierce’.
 E.g. relaxing semantics a bit, Mansi *jal could be compared also with *jalka ‘foot’ or #jülŋä ‘tree stump’ (though these only help with the *j-). For Hungarian íz (dialectally also éz), Finnish & Karelian eto- ‘to find disgusting’ seems like a promising direction of comparison.
 And perhaps they should be. Etymologies are most of the time cited in secondary and tertiary literature without the scaffolding of historical phonology that holds them up in the first place. I suspect this often leads to beginners and non-historical lingusts getting the false impression (or maybe rather, strengthening the natural folk-etymological impulse of thinking) that just similarity is good enough for setting up an etymology.