On the epistemology of sound change, part 1

Continuing from the last post, and toning the meta-ness of the discussion down just a little…
What does it, at the level of everyday research, mean for me to request “justification on the basis of more elementary phenomena” for the concepts of historical linguistics? Say, from the viewpoint of sound change?

The foundations of the concept

The concept of sound change is already implicit in the concept of cognate words. If we assert that a word such as Hungarian ősz ‘autumn’ is cognate to Finnish syksy ‘id.’ (I will not go into unpacking what it means for a “language” to have a “word” that is expressible by a string of letters, although these are good questions to ask as well) — then this means that at one time, a common proto-form of the words existed. The self-contained apparatus of historical linguistics can also produce a graphical representation of this; according to my chosen system of Proto-Uralic reconstruction, it will be *sükśə or perhaps *sükəś. We may also propose a phonetical value for this. Indeed, most linguistic transcription systems already do this implicitly. Despite occasional use of cover symbols for difficult-to-reconstruct segments such as Proto-Uralic *d₂, or in words whose precise phonemic content cannot be resolved from the available evidence, I cannot say I have ever seen a purely abstract presentation of a proto-language.

Usually, the suggested pronunciation of the word will end up different from the real attested words on whose basis it was posited, in at least some respects. This thus requires that changes in pronunciation, i.e. sound changes, have occurred at some point in the evolution of Hungarian, Finnish, and any other Uralic languages. In this particular case, we have the loss of palatalization, *ś > *s, in both Finnish and Hungarian; the loss of *k, plain *s, and the second-syllable vowel entirely in Hungarian; the lowering of *ü to *ö in Hungarian; and its acquisition of length. It would be possible to shuffle some of these changes around (e.g. perhaps it is not Hungarian that has lost a /k/, but Finnish that has gained one? perhaps originally a third type of sound yet occurred here?), but the fact that ősz and syksy are not identical in their pronunciation will remain.

In clearer words yet: that sound change somehow, for some reason happens is clear already from the idea that etymologies exist; that non-identical words can have a common origin. Interestingly, note also that relationships existing between entire languages is not a required assumption at this point.

By the way, note that I do not claim this to be the actual history of how the concept of sound change was developed (that story is much more complex yet). This is only an observation on the internal logical structure of the modern-day theory of historical linguistics.

There is thus “downwards inference” involved here. Instead of tinkering with empirical research on articulation & such and discovering that a certain series of events can add up to the large-scale phenomenon of sound change, we have looked at higher-level data yet and found patterns that can be effectively explained by assuming the existence of sound change — despite not yet knowing the first thing about how it works. As a scientific theory, this is adequate insofar as it can still provide predictions, but naturally it leaves us asking: what exactly are these sound changes? Can we actually see one happening somewhere? Could a detailed understanding of them enhance our understanding of etymology as well?

Motivating the instances

There do exist disciplines like phonetics and sociolinguistics that are directly tackling the questions of the evolution of language on the scale of years, weeks, milliseconds rather than generations. However, the theory of sound change can be further sharpened already by more careful investigation of etymology.

There is of course also the tiny snag that detailed sociolinguistic data or phonetic records do not exist for the pre-modern histories of languages (let alone the entirety of prehistory). So we are mostly unable to directly observe sound changes in their full historical context, and indirect inference from etymological data remains our almost sole option for finding out about them. This means that care is required to not drift off to pure speculation.

What, then, is sufficient evidence for assuming a specific sound change to have occurred?

The “naïve method“, I could call it, is to simply indiscriminately collect sound correspondences that “seem to” exist between some given language varieties, claim as cognate any pairs of words that can be linked by some application of these, and then present some sound changes that can account for the correspondences. This has been, and continues to be, used as the usual first step in investigating the sound correspondences within a group of “obviously related” language varieties, such as dialects of a single Language. (We still do not need to take a stance on whether “non-obvious relationships” can exist between languages.) Or, if we’re investigating loanword etymology, we might look at any pair of languages we believe to have been once somehow involved with each other.

Back to the previous Finnish vs. Hungarian example, e.g. two original s-ish consonants can be assumed right off the bat, that we could preliminarily call *s₁: defined as becoming /s/ in Finnish, vs. zero in Hungarian; and *s₂: defined as becoming /s/ in Finnish and Hungarian both.

I will skip for now the wider problem of how to determine what segment exactly corresponds to what. For Uralic languages this is known to be, 99% of the time, a simple task: an initial stressed syllable corresponds to an initial stressed syllable, consonants correspond to consonants, vowels correspond to vowels.

The naïve method is much too powerful though, and let to run on its own, will inevitably lead to an an unfalsifiable system where anything can be related to anything else. This is because under it, word-level and and sound-level relatedness are translative. If we claim that sz in Hungarian ősz corresponds to the 2nd s in Finnish syksy, then it follows that the words are related in general, and that also Hungarian ő corresponds to Finnish y. If this is taken as an excuse to now relate any word that has y in Finnish to any word that has ő in Hungarian, and so on forth — this allows eventually racking up a correspondence library that allows relating everything to everything else.

There are, in principle, two ways of avoiding the problem. The first is a purely statistical approach: if two words don’t share at least some proportion X of known sound correspondences, we do not accept the comparison and do not accept any new sound correspondences that it would imply. This algorithm requires a “seed” of correspondences though — if you sic it on languages of which you know nothing, it will detect no related words, what with no correspondences being accepted yet. A “seed” must be instead generated by some other method. Likely ideas for this might be:

  • Having to build up a set of word comparisons that is closed with respect to sound correspondences, and where every correspondence occurs at least n times. An n = 2 example might be Finnish kala, pala, kesä, pesä ~ Northern Sami guolli, buolli, geassi, beassi (‘fish’, ‘bit’, ‘summer’, ‘nest’).
  • A set of correspondences that occur highly often and/or are between identical segments.

These seed methods, I believe, probably won’t manage to uncover everything that can be uncovered all by themselves, but let’s leave a closer analysis of their pros and cons for some other time.

A more interesting point is that these methods, phrased solely in terms of sound correspondences, are mainly focused on binary comparison. Correspondence-counting of any kind however runs into some rather nasty mathematical problems when a larger number of language varieties is involved. Consider for example: should a correspondence set t ~ t ~ t ~ t ~ θ ~ t be counted as a completely different entity from a correspondence set t ~ t ~ t ~ t ~ t ~ t?

  • If yes, we hit what is called the curse of dimensionality. Say we have two languages with 20 consonants each: there are then 20² = 400 possible correspondences between these, and we can well expect a decent bundle of etymological data to not only highlight which of these correspondences are highly recurring, but also which are noticably rare or absent. But if we rather have as few as six languages, the count of possible correspondence sets becomes 20⁶ = 64,000,000. The lexical stock of even the best-documented languages only reaches a fraction of this, and a typical etymological data set is unlikely to exceed a couple of thousand words. Given a space of millions of possible correspondences, most data points will perhaps cluster at some stable points, or in their vicinity. Any correspondence sets that turns up outside of these islands (say, an apparent correspondence h ~ t ~ t ~ d ~ s ~ z) will be difficult to assess. Also, by far most possible correspondence sets will be entirely absent, and we’ll have no chances of telling if the absense of one particular correspondence carries any statistical significance.
  • And if no — if we treat comparison sets as built from pairwise correspondences — then the transitivity problem pops up again. If we find a correspondence t ~ t ~ d ~ d ~ d ~ d, and we already know the existence of a correspondence t ~ t between the first two languages, can we really count it as evidence for the unity of this larger correspondence set in general?
  • And what of incomplete correspondence sets? Suppose we have p ~ b between languages 1 and 2; b ~ v between languages 2 and 3; p ~ v between languages 1 and 3. Can we really take this as sufficient evidence to unite them to a single correspondence set p ~ b ~ v? What if a correspondence p ~ p between 1 and 3 exists as well?

Instead of getting stuck on fine-tuning these problems, it’s however possible to change gears. There is a second, fundamentally different method possible as well: the chronological approach, whose nature I will be elaborating in the next post of this series.

