I’ve sometimes remarked (but until now, not on this blog) that one interesting difference between Uralic and Indo-European studies is radically different approaches to lexical reconstruction. Uralic studies have for long hung on to the idea of a deeply stratified family tree, and accordingly, word roots dating to the same, nearly identical stage of phonological reconstruction have been varyingly separated as “Proto-Finno-Samic”, “Proto-Finno-Volgaic”, “Proto-Finno-Permic”, “Proto-Ugric”, “Proto-Finno-Ugric” or “Proto-Uralic” — depending simply on in which branches of Uralic have descendants survived. While on the IE side, all available reconstructions are generally treated under the title “Proto-Indo-European”, no matter if we’re dealing with a word root with a narrow distribution covering only e.g. Germanic and Balto-Slavic, or one found everywhere from Irish to Bengali and from Hittite to Tocharian. (Fairly often also quite different reconstruction stages are equated, at least in name; mostly in connection to laryngeal theory, which I find to be in mostly poor shape when it comes to distinguishing between comparative and internal reconstruction.)
Ironically enough, both sides appear to have been wrong. The evidence for most of the traditional intermediate groupings of Uralic has either evaporated long since, or has turned out to have been illusory all along; while studies on the dialectification of Indo-European fairly consistently keep suggesting the status of Anatolian and possibly Tocharian as early splits.
Focusing more on the IE side for once: there do not, yet, seem to be general-purpose sources that would examine how many of the numerous typological and allegedly synchronic analyses of Proto-Indo-European would hold even if we restricted our view to just the oldest material. (There are individual papers out there somewhere I’m sure, but admittedly I have not been looking especially heavily for them.) But in order to get some kind of a rough idea, I’ve started a small project: taking Wiktionary’s list of Proto-Indo-European roots as a starting point and indexing them according to their distribution across the better-documented IE languages (i.e. no Phrygians or Messapics). You can check on the work in progress over here. Sure enough, while convenient, this is probably also a fairly unsystematic sample of data. I might want to follow up on this by taking at some point a look at some more comprehensive modern rootlists, such as the LIV.
This anyway comes out as a type of dataset I have some practice with by now: a distribution matrix, recording the lack or presence of a root in a subgroup.  There are some interesting things you can do with such data, although I think a generally applicable theory remains undeveloped. I already have several similar projects involving Uralic data in preparation — of these, the two in the best shape are a spreadsheet database of the common Samoyedic lexicon (about 780 entries, mostly from Janhunen’s Samojedischer Wortschatz; currently not missing much else than finishing translating the German glosses into English), and another one listing the best-preserved common Uralic lexicon (with reflexes in six or more of the nine main Uralic subgroups, which comes out at about 200 entries; currently not missing much else than finishing adding the intermediate Proto-Samic/Proto-Finnic/Proto-Samoyedic forms). 
With PIE and the Indo-Hittite question, one followup could be similar filtering of the evidently abundant “Common IE” lexicon (= everything not attested from Anatolian and Tocharian). It’s after all probable that a lot of vocabulary that once occurred in Anatolian and/or Tocharian remains simply undocumented in the literary records of the languages; and, other things being equal, a word root attested widely across the modern IE languages is more likely to be an archaism (or an erroneous comparison) than one reconstructed on the basis of more fragmented data.
But at this point I run into the question: what kind of a metric should I use for assessing how well has a given proto-root been retained? A flat sum-of-branches function seems to still work decently for Uralic, but for IE, not so much. The fundamentally underdocumented Anatolian and Tocharian are one type of problem, while another are the “family-isolates” Albanian and Armenian, where an order of magnitude less inherited vocabulary is found than in the old major groups like Greek or Indo-Iranian.  It seems clear that if a Common IE root is only lost from Alb.+Arm., this is not as big a deal than if it were instead lost from Gr.+II. But how much so exactly? And suppose I were to treat II reflexation worth e.g. one point, but Albanian reflexation worth one half — should I then also treat e.g. Slavic reflexation worth something like 0.8, given that the group is also clearly younger (and has had more opportunities for renewal of vocabulary)?
Initially it may seem that just noting the overall rate of lexical retention should work. Let’s say Albanian has lost 70% of the Common IE lexicon, while Germanic has lost 10%; does this means that loss in Albanian is therefore seven times less valuable as evidence?
This approach however would seem to conflate lexical archaicity and lexical diversity. Even if, say, Germanic and Indo-Iranian are both subfamilies that retain 90% of the common IE vocabulary, this does not imply that their histories have been essentially identical. As far as we know from history and archeology, this “symmetry” would be due to the former having been for long hanging out in the margins of Northwest Europe, and has not had as many opportunities for renewing its lexicon; while the latter has split into further subgroups already early on, including several languages first attested soon afterwards, and so the odds are good that any given IE root could have been retained in at least a few descendants somewhere.
Another variable to take into account thus might be the amount of lexical diversity within a language group. But I also have yet to work out how to formulate a metric for this, exactly. And the question kind of iterates… determining the lexical diversity within e.g. Indo-Iranian is probably going to require a way to assess the lexical distribution between its main branches; and then likewise for determining the lexical diversity within e.g. the Persid languages; and then finally the same also for varieties of modern Persian. Ultimately this then reduces to a question on how well have individual language varieties been documented in the first place.
I might simply need a clearer theory of what am I trying to assess about etymological distributions in the first place. In principle, there seem to be at least two somewhat distinct issues involved:
- attempting to determine the “internal rate of loss/innovation” for a particular lexeme (which, contrary to even the more sophisticated lexicostatistic theories out there, is in all likelihood not a constant, but rather something further depending on a language’s sociolinguistic situation and other such external variables); and from this approximate how much further back from its oldest strictly reconstructible stage is it likely to date
(e.g. if we can reconstruct the Common IE roots *kakka- ‘poop’ and *pléwmon- ‘lung’, we could perhaps assume from just the semantics, already before any sound-symbolic or similar considerations, that the former is younger than the latter)
- attempting to determine how likely it is that a particular widespread word root is actually a later areal innovation rather than common inheritance
(e.g. all other things being equal, a putative PIE root that has not been attested from any Celtic language is more likely to be a lexical innovation that never reached the westernmost Late PIE dialects, than one that does extend there; or, for that matter, a word root attested only from Latin, Greek and Anatolian carries a bigger risk of involving serial loaning than a word root attested only from Umbrian, Slavic and Anatolian)
Both of these approaches would provide evidence on how likely is it that e.g. some Common IE root was or wasn’t already present in Proto-Indo-Hittite. But they regardless involve distinct historical processes.
 Technically these should be considered probabilities, not boolean variables. If a reflex is uncertain or has unclear features, we can mark this uncertainty as a 0.8 or 0.5 or 0.1, instead of a plain 1 or 0. And even the zeros and ones should perhaps be actually considered to be shorthand for ɛ and 1-ε, for some miniscule ɛ approximating the probabilities that we’re in fact wrong about how the history of e.g. Greek works, or how historical linguistics and etymology works in general.
 Further information on these and my other similar projects available on inquiry.
 Although it is interesting to note that, so far, almost all vocabulary with Anatolian parallels seems to be fairly well-retained even in Alb. and Arm. compared with poor retention otherwise. Perhaps this indicates the greater resilience, and attestability, of core vocabulary compared to peripheral vocabulary? But already the “Indo-Tocharian” layer seems to fare worse. We’ll see if this pattern carries thru.