Thoughts on lexical data and its subgrouping

A general theory of distributional analysis of lexical data that incorporates also a degree of historical analysis remains a thing I would like to exist. I’ve seen plenty of work of some kind done on this already, yes, but most of it strikes me as either methodologically primitive groping in the dark, or historically agnostic surface-descriptivism.

The type of data I’m thinking of generally comes in two forms: comparative data, drawn from distinct branches of a language family; and dialectological data, drawn from a set of mutually intelligible varieties within a dialect continuum. Both of them will look fairly similar, being best displayed as a matrix with a list of tokens and a list of varieties as its axes, and (at its simplest) 0 or 1 at each cell, depending on if the token in question is found in the variety in question.

Some differences of note between the two forms of data, though:

  • Dialect data often ends up forming substantially larger datasets. I’m currently poking a little bit around the Estonian dialect data from Väike murdesõnastik. Though called the ‘Small dialect dictionary’, it runs to a respectable 73000+ entries, an order of magnitude more than you’ll see in even in the most positivistic comparative dictionaries. A count of 124 varieties sampled also overshadows all but the largest language families, if considered at the level of distinct languages. An actual large dialect dictionary will do better yet — the Finnish dialect dictionary covers some 530 varieties, has been under work since the 60s, and is projected to include ~350000 entries when finished.
  • Dialect data often includes an abundance of single-variety entries. In comparative analysis we almost always require an entry to be found in at least two varieties under comparison; if not more (for reasons of subgrouping). Theoretically, morphophonological variation within a lexeme can allow likely reconstructing it to an early enough period all by itself, but this is rare.
  • Comparative data may require probabilistic encoding. The ever-present uncertainty in etymological comparisons, sometimes also hapaxes from unreliable sources, will often make it impossible to state with anything resembling certainty if a given word-group has a reflex in a language or not — or if the group even exists in the first place. Possibly this last point even means that the “rows” would be better initially modelled as networks than vectors, with probabilities primarily applying to pairwise comparisons.

More data means better resolution — which suggests that method development ought to be built based on dialect data, and only later extended to deeper comparative work, once we really know what we’re doing.


One initial research question I would like to tackle is subgrouping. By which I do not mean the establishment of a complete phylogenetic tree: this is probably a counterproductive question for dialect continua in general, where isoglosses will be often probabilistic rather than hard-and-fast anyway (“word X is known by 85% of speakers at locale A, but only by 22% by speakers at locale B”) and unique ancestry cannot be defined even in theory (people move; idiolects have complex many-rooted origin). But suppose that our data isn’t a single continuum, and rather represents two or more clearly separate clusters; or, more realistically, two or more originally separate clusters bleeding together due to secondary contacts at their edges. (In the case of Estonian, at least the South Estonian / North Estonian division will likely be one of these.) Will we be at least able to detect this?

For starters, let’s simply consider the question of synchronic subgrouping. I can think of multiple way of how this could be done. Here are some fairly simple means of circumscribing subsets, in what will likely be descending order of strength:

  1. Discreteness: Cluster A and cluster B share no vocabulary.
  2. Innovativeness: The amount of vocabulary found only within cluster A is substantially higher than the amount of vocabulary shared with other varieties.
  3. Compactness: All varieties within cluster A share proportionally more vocabulary with each other than with any varieties outside of it.
  4. Local compactness: All varieties within cluster A have a “neighborhood” of varieties also within the same cluster, with which they share proportionally more vocabulary than with any varieties outside of the cluster. (But opposite ends of the cluster may still share more vocabulary with external varieties than with each other.)

Further notes:

  • All schemes allow nested subgroups.
  • From #2 on, the existence of a cluster does not imply that the remainder of the data can be sorted into one or more similar clusters. We could define “mutual compactness” etc. for cases where this happens.
  • “Innovativeness” seems for now somewhat non-well-defined. Without establishing some kind of cluster-internal requirements for membership, we could imagine affixing “imposter” varieties to a highly innovative cluster to bring its innovativeness down, but not far down enough to make it statistically insignificant. But perhaps subgroup inclusion filtering will fix this? If there are some partially overlapping innovative subgroups, say A, A’ = A ∪ X, and A” = A ∪ Y, but there are no innovative subgroups that intersect A (= are neither a proper subset nor a proper superset of A), we can isolate A as the “real” innovative subgroup.
    — Does this require demanding, however, that A will be also more innovative than any of its possible extensions?
  • “Discreteness” has a similar issue, but easily dispensed with in similar ways, ending with a maximal set of discrete clusters  (“intermediate discrete subgroups” do not actually exist).
  • A pair of varieties will be either both compact and locally compact, or neither.
  • A compact subgroup can fail to exist only if all varieties share an equally high percentage of vocab with at least two other varieties, forming a chain or network covering all the varieties.
  • The criteria allow distinguishing several different senses of “dialect continuum”, depending on which types of subgroups a set of dialects does not allow division into.
  • All of these definitions can end up identifying a group by shared archaisms rather than by shared innovations, as heavy lexical replacement in one subgroup can leave old vocabulary restricted to the varieties outside this group. I think this is a feature, not a bug. Synchronic perceptions of linguistic groups are often also based on such characters, after all. A good example might be Ludian, whose most prominent distinguishing characteristic from Karelian varieties is probably the retention of Proto-Finnic *b *d *g. (Consider also typological categories in biology, such as “fish” with tetrapods excluded or “monkey” with humans excluded.)
  • None of these definitions makes any reference to geography. This is also intentional. Geography should only come into play once we start reconstructing ancestries and contact influences.

I would predict that e.g. South Estonian or Insular Estonian are locally compact subgroups. Probably no large compact or large innovative subgroups exist anymore within the Finnic languages on the other hand, with the possible exception of Livonian.


The above approaches however all share one trait: they assume that subgroups need to be subsets, i.e. that for any variety X we can say that it either belongs or does not belong in the subgroup. Even this is perhaps not quite necessary, though.

Intensional approaches to subgrouping could also be considered, i.e. groups defined in terms of words whose distribution resembles each other, rather than varieties whose lexicon resembles each other.

(This is of course identical to an extensional analysis of the subgroups of the lexicon. Just transpose your words-by-dialect matrix to get instead a dialects-by-words matrix.)

Any intensional subgroups will likely be fuzzy: a single variety could only be characterized in terms of what percentage of various lexical subgroups it contains, not in terms of set membership. When considering Ausbau languages like Estonian or Finnish that are known to be ill-defined in extensional terms, this approach could perhaps be more productive. It’s possible that we could still identify a cluster of traits that mostly correlates with the everyday understanding of such terms, and then say that a given Finnic variety has e.g. “92% Estonian-ness, but 24% Finnish-ness”.

Conversely: note that due to symmetry, extensional subgroups are also likely to have fuzzy trait distributions. Any pan-Finnic word will be “100% South Estonian, 100% North Estonian, 100% Livonian” etc. One that is less well-preserved could be “64% South Estonian, 77% North Estonian, 67% Livonian”.

In extensional subgrouping, at this point the usual approach is to locate not just all typical traits, but the uniquely typical (“defining”) traits of a subgroup. Of course, the same procedure will work just fine in intentional subgrouping: for given definitions of “South Estonianness” and “North Estonianness”, we could probably exclude their common traits, and then identify varieties that have zero “strict SE-ness” but non-zero (and, hopefully, high) “strict NE-ness”, or vice versa.


Looking thru actual datasets, I however wonder if this is still making too many underlying assumptions to be immediately explorable. One major concern I have is data homogeneity: in comparative data this is usually mostly assured, but in dialect research, often enough there are major fluctuations in the size of collections known from a variety. The Vms data for Estonian for example ranges a full four orders of magnitude, from completely useless three entries from Vormsi, to some 300±50 entries in the next smallest collections, to around 12500 entries from the parishes of Kuusalu, Kihelkonna and Kodavere each. Problems like “we know 700 words of potential Proto-Uralic inheritance from Samic, but only 200 from Samoyedic” are nothing compared to this.

The simplest approach would probably be to mercilessly prune the data down, by excluding overly small or overly large varieties from analysis. (Not necessarily even as statistical outliers, just as an experiment to see how this changes the analysis.) Another theoretical option — as known from lexicostatistics — is to limit the analysis to a particular set of meanings; but for dialectology, anything based on Swadesh lists is not exactly feasible, as we don’t want to throw out 90%+ of the raw data in the process. Some substantially larger list of “basic” meanings to work on would be required. Yet such a list would likely also have to be language-specific.

We could also attempt explicitly modelling the heterogeneity of the collections. The trivial but naive approach to this would be to treat each dialect collection as a random sample from the total lexicon of the variety. This is probably not the way to go. Usually smaller collections are more likely to focus on “core” vocabulary, while larger ones are more likely to also include derivatives, compounds and the like. This will clearly be a problem. Suppose that varieties A, B and C form, in some sense, a single dialect group; but while the datasets from dialects A and B includes 8000 items each, with numerous derivatives, the dataset from dialect C only includes some 600 items and covers almost solely basic underived words. We would expect that the affinity of A and B would be obvious through e.g. a large number of shared derivatives; but the affinity of C would have to be determined through much more subtler signals.

At a first approximation, keeping count of derivatives separately might help. Of course, what is a “derivative” exactly? Do we count words that are morphologically analyzeable but whose expected base word is absent? Should the answer depend on if this is in dialect A (where it probably indicates the base word indeed being unknown) or in dialect C (where it might have just gone unrecorded)? What if the base word is unrecorded everywhere in the data? This may indicate that apparent morphological analyzability is illusory after all. (I am reminded of SKRK’s habit of listing numerous “derivatives from unknown bases” of which a number have later turned out to be simply loanwords in their entirety, e.g. kangas ‘easily passable type of forest’ ← Germanic *gangaz ‘way’.) Or what if the underived root can only be attested from relatives further off (which may indicate that this derivative had become fossilized already in the last common ancestor of the varieties under study)? What if such a fossilization event only happened in the last common ancestor of some of the varieties studied, while archaic varieties still explicitly maintain the derivational relationship?

Instead of getting bogged in the above word-by-word considerations, perhaps a less fine-grained approach will help, for once. Instead of fine-tuning the separation of the data into “derivative” and “underived” parts, we could simply make an initial pass and then tag each variety with a “percent apparent derivatives” variable. All the above types of errors could still lurk around in this sum total, but since they will not be tied down to single items, we can expect them to partially cancel each other out. Theoretical considerations could maybe also be used to approximate a general error term to apply.

Yet derivatives might not be the only subset of the lexicon to be worried about. E.g. proper names, ideophones and onomatopoeia could perhaps run into similar issues as well. Or they might not; I do not have any firm evidence to go on about this yet. In the end, a lot will depend on what the data has been collected for in the first place.

Advertisements
Tagged with: , , , , , , ,
Posted in Methodology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Enter your email address to follow this blog and receive notifications of new posts by email.

%d bloggers like this: