Weighing etymological distributions

I’ve sometimes remarked (but until now, not on this blog) that one interesting difference between Uralic and Indo-European studies is radically different approaches to lexical reconstruction. Uralic studies have for long hung on to the idea of a deeply stratified family tree, and accordingly, word roots dating to the same, nearly identical stage of phonological reconstruction have been varyingly separated as “Proto-Finno-Samic”, “Proto-Finno-Volgaic”, “Proto-Finno-Permic”, “Proto-Ugric”, “Proto-Finno-Ugric” or “Proto-Uralic” — depending simply on in which branches of Uralic have descendants survived. While on the IE side, all available reconstructions are generally treated under the title “Proto-Indo-European”, no matter if we’re dealing with a word root with a narrow distribution covering only e.g. Germanic and Balto-Slavic, or one found everywhere from Irish to Bengali and from Hittite to Tocharian. (Fairly often also quite different reconstruction stages are equated, at least in name; mostly in connection to laryngeal theory, which I find to be in mostly poor shape when it comes to distinguishing between comparative and internal reconstruction.)

Ironically enough, both sides appear to have been wrong. The evidence for most of the traditional intermediate groupings of Uralic has either evaporated long since, or has turned out to have been illusory all along; while studies on the dialectification of Indo-European fairly consistently keep suggesting the status of Anatolian and possibly Tocharian as early splits.

Focusing more on the IE side for once: there do not, yet, seem to be general-purpose sources that would examine how many of the numerous typological and allegedly synchronic analyses of Proto-Indo-European would hold even if we restricted our view to just the oldest material. (There are individual papers out there somewhere I’m sure, but admittedly I have not been looking especially heavily for them.) But in order to get some kind of a rough idea, I’ve started a small project: taking Wiktionary’s list of Proto-Indo-European roots as a starting point and indexing them according to their distribution across the better-documented IE languages (i.e. no Phrygians or Messapics). You can check on the work in progress over here. Sure enough, while convenient, this is probably also a fairly unsystematic sample of data. I might want to follow up on this by taking at some point a look at some more comprehensive modern rootlists, such as the LIV.

This anyway comes out as a type of dataset I have some practice with by now: a distribution matrix, recording the lack or presence of a root in a subgroup. [1] There are some interesting things you can do with such data, although I think a generally applicable theory remains undeveloped. I already have several similar projects involving Uralic data in preparation — of these, the two in the best shape are a spreadsheet database of the common Samoyedic lexicon (about 780 entries, mostly from Janhunen’s Samojedischer Wortschatz; currently not missing much else than finishing translating the German glosses into English), and another one listing the best-preserved common Uralic lexicon (with reflexes in six or more of the nine main Uralic subgroups, which comes out at about 200 entries; currently not missing much else than finishing adding the intermediate Proto-Samic/Proto-Finnic/Proto-Samoyedic forms). [2]

With PIE and the Indo-Hittite question, one followup could be similar filtering of the evidently abundant “Common IE” lexicon (= everything not attested from Anatolian and Tocharian). It’s after all probable that a lot of vocabulary that once occurred in Anatolian and/or Tocharian remains simply undocumented in the literary records of the languages; and, other things being equal, a word root attested widely across the modern IE languages is more likely to be an archaism (or an erroneous comparison) than one reconstructed on the basis of more fragmented data.


But at this point I run into the question: what kind of a metric should I use for assessing how well has a given proto-root been retained? A flat sum-of-branches function seems to still work decently for Uralic, but for IE, not so much. The fundamentally underdocumented Anatolian and Tocharian are one type of problem, while another are the “family-isolates” Albanian and Armenian, where an order of magnitude less inherited vocabulary is found than in the old major groups like Greek or Indo-Iranian. [3] It seems clear that if a Common IE root is only lost from Alb.+Arm., this is not as big a deal than if it were instead lost from Gr.+II. But how much so exactly? And suppose I were to treat II reflexation worth e.g. one point, but Albanian reflexation worth one half — should I then also treat e.g. Slavic reflexation worth something like 0.8, given that the group is also clearly younger (and has had more opportunities for renewal of vocabulary)?

Initially it may seem that just noting the overall rate of lexical retention should work. Let’s say Albanian has lost 70% of the Common IE lexicon, while Germanic has lost 10%; does this means that loss in Albanian is therefore seven times less valuable as evidence?

This approach however would seem to conflate lexical archaicity and lexical diversity. Even if, say, Germanic and Indo-Iranian are both subfamilies that retain 90% of the common IE vocabulary, this does not imply that their histories have been essentially identical. As far as we know from history and archeology, this “symmetry” would be due to the former having been for long hanging out in the margins of Northwest Europe, and has not had as many opportunities for renewing its lexicon; while the latter has split into further subgroups already early on, including several languages first attested soon afterwards, and so the odds are good that any given IE root could have been retained in at least a few descendants somewhere.

Another variable to take into account thus might be the amount of lexical diversity within a language group. But I also have yet to work out how to formulate a metric for this, exactly. And the question kind of iterates… determining the lexical diversity within e.g. Indo-Iranian is probably going to require a way to assess the lexical distribution between its main branches; and then likewise for determining the lexical diversity within e.g. the Persid languages; and then finally the same also for varieties of modern Persian. Ultimately this then reduces to a question on how well have individual language varieties been documented in the first place.

I might simply need a clearer theory of what am I trying to assess about etymological distributions in the first place. In principle, there seem to be at least two somewhat distinct issues involved:

  • attempting to determine the “internal rate of loss/innovation” for a particular lexeme (which, contrary to even the more sophisticated lexicostatistic theories out there, is in all likelihood not a constant, but rather something further depending on a language’s sociolinguistic situation and other such external variables); and from this approximate how much further back from its oldest strictly reconstructible stage is it likely to date
    (e.g. if we can reconstruct the Common IE roots *kakka- ‘poop’ and *pléwmon- ‘lung’, we could perhaps assume from just the semantics, already before any sound-symbolic or similar considerations, that the former is younger than the latter)
  • attempting to determine how likely it is that a particular widespread word root is actually a later areal innovation rather than common inheritance
    (e.g. all other things being equal, a putative PIE root that has not been attested from any Celtic language is more likely to be a lexical innovation that never reached the westernmost Late PIE dialects, than one that does extend there; or, for that matter, a word root attested only from Latin, Greek and Anatolian carries a bigger risk of involving serial loaning than a word root attested only from Umbrian, Slavic and Anatolian)

Both of these approaches would provide evidence on how likely is it that e.g. some Common IE root was or wasn’t already present in Proto-Indo-Hittite. But they regardless involve distinct historical processes.

[1] Technically these should be considered probabilities, not boolean variables. If a reflex is uncertain or has unclear features, we can mark this uncertainty as a 0.8 or 0.5 or 0.1, instead of a plain 1 or 0. And even the zeros and ones should perhaps be actually considered to be shorthand for ɛ and 1-ε, for some miniscule ɛ approximating the probabilities that we’re in fact wrong about how the history of e.g. Greek works, or how historical linguistics and etymology works in general.
[2] Further information on these and my other similar projects available on inquiry.
[3] Although it is interesting to note that, so far, almost all vocabulary with Anatolian parallels seems to be fairly well-retained even in Alb. and Arm. compared with poor retention otherwise. Perhaps this indicates the greater resilience, and attestability, of core vocabulary compared to peripheral vocabulary? But already the “Indo-Tocharian” layer seems to fare worse. We’ll see if this pattern carries thru.

Advertisements
Tagged with: , , , , ,
Posted in Methodology
12 comments on “Weighing etymological distributions
  1. I have grave doubts about at least five roots that are ascribed to Indo-Hittite in your table.
    1) *pénkʷe ‘five’. For Anatolian, Wiktionary lists Luwian paⁿta. I do not know the ultimate source of this form; at least H. Eichner in his chapter on Anatolian in the book “Indo-European Numerals” (ed. by J. Gwozdanović; Mouton de Gruyter, 1992) mentions nothing similar.
    2) *kʷékʷlos ‘wheel’. Wiktionary lists Hittite kugullas ‘donut’. The word is absent from A. Kloekhorst’s “Etymological Dictionary of the Hittite Inherited Lexicon”, which means that Kloekhorst does not consider it Indo-European in origin. J. Puhvel’s “Hittite Etymological Dictionary” lists this word as kuk(k)ul(l)a- (c.), a measure or quantity of semi-solids, sometimes with ‘copper’ determinative. Puhvel mentions that “[a]n edible NINDAku-gul-la-an (acc. sg) is found as an artonym”, and concludes that “[t]he semantic common denominator seems to be ‘lump, ball, loaf’”. He does not compare this word with *kʷékʷlos, however.
    3) For *kʷetwóres ‘four’ Wiktionary lists Lycian teteri. Again, it is not mentioned by H. Eichner in his treatment of Anatolian numerals. In fact, there is another etymology that derives Hittite kutruu̯an- ‘witness’ from *kʷetwóres (witness as a ‘fourth party’ along with plaintiff, defendant and judge). Against this etymology is the fact that Anatolian languages have another root for ‘four’: Hitt. me(i̯)u- / mei̯au̯-, CLuw. māuu̯a-.
    4) For *bher- ‘to bear / carry’, Wiktionary gives Hittite kapirt (without any gloss whatsoever). This word (kapart- / kapirt- (c.) a rodent) is discussed by Kloekhorst in EDHIL. The idea is that the word goes back to virtual IE *kom-bhḗr-t- / *kom-bhr-t- ‘one who carries together, hoarder, pack rat’. I can only agree with Kloekhorst that it is “too dangerous to assume that only the word for a rodent would display an inflection type that is so archaic that it is unattested elsewhere, or a phonetic development (“proclisis” of *kom- > ka-) that is not assuredly attested in other words”.
    5) For ‘brother’, we have Lydian brafrsis – a word whose meaning is only a guess – “Die Bedeutung bleibt dunkel”, according to Gusmani’s “Lydisches Wörterbuch”.
    In short, I would not use dubious comparisons for this kind of work. It is too easy to “find” an IE etymology (or two) for just about any word in any IE language, which makes the very idea of weighing etymological distributions fruitless – unless we work only with uncontroversial etymologies. Or perhaps we should work only with the words that have identical meanings, which would take us back to classical lexicostatistics.

    • j. says:

      I’m aware that the data is going to contain various inaccuracies. I had noted myself some of your concerns as well. Probably anything based on any single source would likely also contain at least a few similar cases. The hope/expectation, though, is that their number will not be as high as to skew the overall picture — and so, if there are some actual robust differences between the Indo-Hittite and Common IE lexica, they should emerge already from examination of this type of data. Finer details would definitely have to wait for an updated dataset, perhaps indeed including detailed probability assessments of each etymological comparison. (I do not believe establishing a clear cutoff between dubious and uncontroversial etymologies would be possible.)

      Of course, I suppose you might also be attempting to tell me that Anatolian etymology remains to this day an almost entirely open field where most proposals are merely speculation…?

      On the other hand, concerns about data quality don’t actually affect at all the theoretical issue at stake, i.e.: if we suppose we have some reasonably reliable distributional data, what conclusions could be drawn from it? This is not a trivial question, and even restricting ourselves to core vocabulary would not amount to reducing the question to classical lexicostatistics (an approach that requires various further assumptions, some of which I consider clearly unwarranted).

      • You are right: any source would contain at least some inaccuracies. But some sources are better than others. For example, LIV has its share of dubious etymological solutions, but still they are much fewer than dubious etymologies in Pokorny’s IEW. And since the scope of LIV is limited to primary verbs, we can be sure that its authors have undertaken a systematic search for comparisons and therefore (almost) no valid comparisons were overlooked.
        And if there was no such systematic search in the process of compiling an etymological dictonary, that dictionary will possibly have skewed distribution of etymologies. It is almost certainly the case with UEW, where there are much more dubious etymologies for Hungarian words than (for example) for Mordvinic words.
        As for comparing only words with identical meanings (not necessarily from the “core lexicon”), this approach has one significant advantage: it allows us to deal with both positive evidence (Sanskrit and Old Irish words for ‘brother’ are related) and negative evidence (Hittite and Sanskrit words for ‘brother’ are not related). The fact that reflexes of some root are not attested in Hittite may be due to chance (the relevant word is not attested or was not correctly etymologized), but the fact that Hittite and Latin words for ‘four’ are not cognate is significant irrespective of how much other Hittite words are attested or correctly etymologized.

        “I suppose you might also be attempting to tell me that Anatolian etymology remains to this day an almost entirely open field where most proposals are merely speculation…?”
        At least some recent sources on Anatolian etymology, such as Kloekhorst’s EDHIL, are much more balanced in this respect than some earlier works.

  2. David Marjanović says:

    Oh, so you are Tropylium? :-)

    I keep being surprised by how little work has gone into IE phylogeny. Not only has Proto-Italo-Celtic not been reconstructed to the best of my knowledge, but the first ever Proto-West-Germanic reconstructions only came out in 2012 and 2013.

    It’s not just vocabulary, of course. From slide 22 onwards, this presentation shows “that ‘the most common type of PIE present’ is not PIE at all”, because it’s missing in Anatolian and very rare in Tocharian; it urges further research on ‘rediscovering’ “PIE (and other deceptively familiar reconstructed languages)”.

    It seems you’re trying to do this with a much larger dataset. That’s a good idea; I hope you get it to the point that you can publish it some day. What surprised me most about that paper is how incredibly much phylogenetic signal there is in their (Swadesh-200) dataset; the consistency indices of their trees are so high that if someone tried to get them published in biology from a dataset of the same size, I’d immediately think the authors cherry-picked their data! This means there’s almost no homoplasy in their trees – almost no borrowing, almost no accidental convergence. Obviously, using a larger dataset of less basic vocabulary would lower the CIs, but what I’m saying is that there’s a lot of room till things become indistinguishable from random.

    • j. says:

      The reconstruction of intermediate proto-languages, especially if separated only relatively narrowly from their parent node, is a very different affair from the reconstruction of bottom-level ones though. I suppose e.g. all the “contents” of Proto-Italo-Celtic could be largely put together from pre-existing lexical research and grammatical and phonological notes compiled elsewhere. The main question becomes instead deciding which specific level of development to call “the” proto-language — this can be quite sensitive to interpretations about the relative chronology and phonetic development paths of various innovations.

      The Rexóva et al. paper is interesting, though I’m not exactly attempting to discover subgroups from the lexical distribution, as much as attempting to check if the already proposed ones (mainly the Common IE node, though I won’t protest if results turn up for Indo-Tocharian instead) could be fitted with some additional linguistic evidence. For example, would the disputed *a and *b turn out to be disproportionally rare in Anatolian?

      As two additional notes: one, I suspect their results with Armenian as an outlier are also in error, mainly involving its greal lexical innovativity, and which might be partially fixable by Tocharian as a second outgroup. Long-branch attraction, was it called? as they seem to note for Albanian already. Two, I continue to be annoyed that studies like these never report the actual evidence their analysis ends up using in support of this or that subgroup; in cases like these, I think it would be quite valuable for checking that there isn’t anything in there that we would know from the more detailed philological evidence to be shared retentions or areal innovations rather than shared innovations.

  3. David Marjanović says:

    The main question becomes instead deciding which specific level of development to call “the” proto-language — this can be quite sensitive to interpretations about the relative chronology and phonetic development paths of various innovations.

    That’s exactly where making these hypotheses explicit by reconstructing intermediate protolanguages should help.

    For example, would the disputed *a and *b turn out to be disproportionally rare in Anatolian?

    Hard to say in principle. IIRC, only one Anatolian language hasn’t merged *a (including *e next to *h₂) and *o, and it’s one of the less well documented ones like Lycian or Lydian; and the plosive-row mergers would likely have destroyed much of the evidence for or against *b.

    Long-branch attraction, was it called?

    Yep: when the evidence for shared innovations is undone in a branch by a lot of later changes, so that random similarities take over and place that branch next to another long branch (usually close to the root), branch length being measured in this case as the number of changes.

    I continue to be annoyed that studies like these never report the actual evidence their analysis ends up using in support of this or that subgroup

    Such lists are very, very long, but can be reproduced by repeating the analysis, which is probably a matter of just a few hours. Can I get Dyen’s dataset somewhere…? Too bad the authors didn’t publish their data file; that wouldn’t fly nowadays.

    • j. says:

      IIRC, only one Anatolian language hasn’t merged *a (including *e next to *h₂) and *o, and it’s one of the less well documented ones like Lycian or Lydian; and the plosive-row mergers would likely have destroyed much of the evidence for or against *b.

      This would not be a problem. Provided that we can reconstruct the distinction between e.g. *b and *bʰ in the first place from Greek or Indic or etc. evidence, it’s possible to ask if the *b-words would be perhaps not found at all in Anatolian (i.e. perhaps they are post-Anatolian innovations?)

      Such lists are very, very long, but can be reproduced by repeating the analysis, which is probably a matter of just a few hours.

      I recall asking the authors of one study for examples of this data for some relatively suspicious nodes their analysis turned up, and being reported back that they do not know how to dig up such data from their analysis, or even if their analysis package even supports such an option at all. :|

      I also recall seeing one phylogenetic study on linguistic evolution, on the Turkic languages, that did report its defining innovations for its phylogenetic tree. The results of the analysis were not reassuring: e.g. a single innovation (*ž > j), which obviously must’ve been a widespread areal feature, was used for supporting several distinct subgroups.

      If also the standard algorithms of phylogenetic reconstruction generate their trees based on counting this kind of obviously nondiagnostic innovations, perhaps I should even consider nontransparent analyses of linguistic data unreliable by default, especially as applied to dialect continua. (But I’d like to see some further explicit evidence either way, first.)

  4. David Marjanović says:

    This would not be a problem.

    It would be a problem for telling if *a and *b were less rare than elsewhere.

    I recall asking the authors of one study for examples of this data for some relatively suspicious nodes their analysis turned up, and being reported back that they do not know how to dig up such data from their analysis, or even if their analysis package even supports such an option at all. :-|

    Well, that’s bad. The two programs that are most widely used today for parsimony analysis are very much not black boxes. I should write to Rexová and/or Zrzavý sometime.

    • j. says:

      No, there seems to be some kind of persistent confusion here. The question is not in telling if the synchronic phonemes /a/ and /b/ are somehow rare (this is not the case in any attested IE language, IIUC), but if roots where reconstructed *a and *b occur are rare. This requires absolutely no information about what the Anatolian reflexes are, only whether Anatolian cognates exist. I.e. the hypothesis is that *a and *b might be post-Indo-Hittite innovations, found only in loan vocabulary.

      For an analogy, consider the question whether English /ʒ/ has any common (West) Germanic origin. Simply saying “well, (varieties of) German and Dutch do not even distinguish that phoneme, therefore the question is difficult to investigate” does not cut it. If there are no good enough inherited etymologies connecting English /ʒ/ to German or Dutch equivalents, we’d have to consider it an innovation. Cases like Asia ~ Asien or garage ~ Garage exist of course — but even if we didn’t know their loan etymologies, they do not present a coherent picture enough to be considered common West Germanic.

      With Anatolian we probably don’t have a clear enough picture yet to weed the Asias-and-garages away from the data immediately, but if they exist, we should be able to see that the words in question are statistically few, concentrated in easily diffusible lexicon, and perhaps not as well represented elsewhere in IE as we’d expect of old inherited vocabulary.

      • David Marjanović says:

        I.e. the hypothesis is that *a and *b might be post-Indo-Hittite innovations, found only in loan vocabulary.

        My point is that this couldn’t be distinguished from the (highly unlikely…) opposite hypothesis that Anatolian and PIE were full of *a and *b, and most words that contained them just happened to die out in the non-Anatolian branch.

        this is not the case in any attested IE language, IIUC

        B is conspicuously rare in Sanskrit.

        • j. says:

          Ah, I see. Yes, those hypotheses are indeed not distinguishable. Though I think yours is unlikely enough to be just about a priori rejectable just per Occam’s Razor, as long as we do not know of any mechanism for favoring such a skewed development.

  5. David Marjanović says:

    (Oh, and, while I’m sure that Dutch Azië has [ʑ], I haven’t so far encountered this assimilation in anything that counts as German. The northern half or more of Germany uses [zj], and in the [z]-less south you’ll find [si] as an unreduced syllable.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Enter your email address to follow this blog and receive notifications of new posts by email.

%d bloggers like this: