Statistical etymology: A Votic example

I have last Friday picked up a dictionary of the Mahu dialect of Eastern Votic (Castreanianumin toimitteita 27, 1986), based on Lauri Kettunen’s collections from about a hundred years ago. [1]

This is not a particularly huge book, with only about 150 pages of lexical data, set in a relatively large monotype font, too. It probably won’t be of much use if one wished to e.g. translate Firefox into Votic. Its usability as tourist dictionary might be limited as well (even if we ignore the sad fact that Votic is hard moribund, with only some dozens of speakers left). But it seems like a good reference for a linguist wishing to make some contact with the language. Or: a handy unit of data for a linguist wishing to understand the lexical structure of languages.

The lexicons of natural languages are not random in their makeup. Phonemes have differing frequencies of occurrence in different positions of words; and different tendencies of combining with each other. And although one can certainly find linguists who will attempt to offer explanations in terms of elaborate synchronic phonological constraints and preferences, I find this a fundamentally flawed approach. [2] Much more often, any patterns evident in the lexicon are best understood as the fossilized results of historical processes: sound changes, loanword strata and evolving standards of sound-symbolic conventions. The study of a language’s lexicon even at a single point in time will likely turn up insights into its history.

For this type of analysis, this Votic dictionary actually seems like a rather good sample size. The lexicon of any major literary language would be both overwhelming in size (possibly thousands of pages); as well as swamped with recent cultural loanwords (if you happen to find a word shaped approx. like /banana/ or /platinum/ in a given language, this will not tell you much about its prehistory). Neither of these problems is apparent here, and it’s possible to focus on the big picture without getting stuck on data wrangling. On the other end, a simpler list yet of say 100 words, whether artificially truncated or recorded in passing in 1820 from some now-extinct language, would not allow for many statistically significant conclusions at all.

A simple starter example: the Finnic languages have, originally, not contrasted voicing in obstruents (as was the case already in Proto-Uralic). This situation still remains in place in Estonian, Northern Karelian, and dialects of Finnish. Votic, however, sits on the side of the siblings to have fully embraced voicing, and contrasts voiced and voiceless versions of all obstruent consonants: /p t tš k f s š/ ≠ /b d dž g v z ž/. Suppose we were to hand a copy of this dictionary to a linguist who’s never worked with Finnic before. Will they be able to uncover this older constraint?

The answer seems likely to be “yes”. Only minor etymological analysis is required — which the dictionary itself provides, even. The lexemes in the dictionary are glossed in both Russian and Finnish, the two major contact languages of Votic. Additionally, several words identifiable as recent Russian loans are indeed so marked. This allows an initial separation of the lexicon to two mostly disjoint layers: those of Finnic vs. Russian background. (Though of course Finnish has some Russian loanwords as well, and small amounts of words whose origin is not immediately obvious can also be found.)

A look at words beginning with voiced obstruents other than /v/, as well as words beginning with /f/ shows that they, as a rule, belong in the Russian layer. This is a small set to begin with, and after this cleanup, no more than seven counterexamples remain:

  • balalaittaag ‘to gossip’
  • bëëg ‘isn’t’
  • borissag ‘to bubble’
  • bulissag ‘to bubble’
  • börö ‘ironing board’
  • däädi ‘some relative’
  • filissaag ‘to whistle’

So we have four onomatopoetic verbs, one unstressed particle, one nursery word, and one fully legit content word. This is not sufficient evidence to postulate the voicing contrast to be original in the initial position, not when evidently inherited words beginning with /p t tš k s v/ number multiple hundreds altogether. [3]

A more detailed examination would find that medial voiced consonants other than /v/ can similarly be shown to be secondary — they occur as the consonant gradation alternants of the voiceless ones. Exceptions, as a rule, again occur only in Russian loans and probably some onomatopoeia. The full details would be more difficult to dig up though, so I am leaving this as an excercise for the interested reader. ;)

[1] In case anyone else is interested, some overflow stock of these from dunno where is still up for grabs at the University of Helsinki’s Dept. of Finno-Ugric Studies (Metsätalo/Unioninkatu 40, 4th floor).
[2] This may not be an entirely fair comparison, but… I have in mind the image of a “generative geologist” attempting to locate physical constraints present in gneiss or sediment that force its minerals to hold a macroscopically banded rather than homogenous structure.
[3] I will not dwell on /š/, also mainly a loanword phoneme.

3 comments on “Statistical etymology: A Votic example
  1. David Marjanović says:

    What happened to your layout? :-(

    Could börö be a loan of some Norse cognate of board?

    And although one can certainly find linguists who will attempt to offer explanations in terms of elaborate synchronic phonological constraints and preferences, I find this a fundamentally flawed approach.

    Some such constraints and preferences no doubt exist, and influence which sound shifts and which analogical changes are more likely to happen. They’re just never the whole story; “everything is the way it is because it got that way” (D’Arcy Wentworth Thompson, development biologist, in his 1917 book On Growth and Form).

    The example that comes to mind is the pretty common shift from word-final /m/ to /n/. It happened, for example, between PIE and Proto-Germanic, then a round of apocope left /m/ stranded at the ends of words again, and then the same shift happened again during Old High German times, and then yet another round of apocope has left modern German littered with final /m/! Of course I haven’t tried, but I strongly suspect that all of this would be reconstructible even if OHG were unattested.

    • j. says:

      What happened to your layout? :-(

      The old theme had a bit too narrow a text area, I think, and it had some issues with the display of IPA. I am not fully satisfied with the current one either (e.g. the quotes appear kind of huge) but I’m done with tweaking things for now.

      Could börö be a loan of some Norse cognate of board?

      Doesn’t seem entirely implausible, but I wonder whether there are any precedents for isolated Germanic loans in Votic. Its Slavic cognate *bьrdo ‘comb’ might be an equally good place to go looking in.

  2. Precisely this line of reasoning was used by Dmitry Vladimirovich Bubrikh to argue that the ancestor of modern Erzya did not have voiced obstruents. Bubrikh wrote at the time when the official doctrine of Soviet lingustics was Marrism that denied the existence of proto-languages of any kind.
    Bubrih was not allowed to work on Proto-Finno-Ugric reconstruction, so he turned to internal reconstruction of various FU languages, including Erzya. His book “Historical grammar of Erzya language” (“Историческая грамматика эрзянского языка”) was published posthumously in 1953.
    In his book, Bubrikh shows step by step what can be inferred on the structure of pre-Erzya (this is my term, not Bubrikh’s) on purely internal grounds. He meticulously demonstrates that
    1) pre-Erzya word-initial obstruents were always voiceless,
    2) in pre-Erzya clusters of several obstruents, obstruents were always voiceless,
    3) obstruents between two consonants (even between two sonorants) are voiceless even in modern Erzya,
    4) word-medially, Erzya voiced obstruents go back to earlier voiceless obstruents,
    5) word-medially, Erzya voiceless obstruents (in the positions where we expect voicing) go back to earlier geminate voiceless obstruents (in Erzya morphonology they behave like clusters),
    6) word-finally, Erzya voiced obstruents go back to earlier voiceless obstruents.
    Thus, pre-Erzya did not have voiced obstruents.

