This page last updated 2018-11-25
I publish various data via this blog. In the foreseeable future, most of this will probably be indices to or digitized versions of existing etymological and lexical publications.
From the dropdown menu in the top bar you can currently find two datasets hosted on-site. Some are instead described in blog posts, either more structured files hosted off-site, or datasets that are small enough to be included within a blog post in their entirety:
- An index of Lexikon der indogermanischen Verben
- A topical index of the publication series Suomalais-Ugrilaisen Seuran Toimituksia (up to 2015 / vol. 272 )
- A list of real and apparent word doublets in Finnish
- The known corpus of Yurats (a single wordlist)
There are also numerous datasets I still have under preparation. I do not have the space or time to cover here in detail the structure, goals, sources etc. of these, but reasonably complete ones include e.g.
- an index of the oldest Germanic loans in Finnic, as identifiable by phonological criteria, building on Aikio & Aikio (2001)
- the Proto-Permic comparative lexicon, as given by Csúcs (2005)
- the Proto-Ob-Ugric comparative lexicon, as given by Honti (1982)
- an index of Iranian loanwords in Ob-Ugric, as given by Korenchy (1972)
- an index of Proto-Samoyedic vocabulary by Janhunen (1977)
(presented earlier at the 7th Int’l Conference on Samoyed Studies).
- an index of the Proto-Uralic comparative lexicon of Collinder (1955)
(the Finno-Ugric portion remains very much WIP)
- the Proto-Uralic/Finno-Ugric/Finno-Permic comparative lexicon of Sammallahti (1988)
(building on an earlier text-file digitization by T. Salminen)
- an Ural-Altaic comparative lexicon by Räsänen (1955)
“Indexes” generally include only the reconstructed proto-forms, and for daughter languages only list if a reflex is given at all by the source(s). “Lexicons” include also the actual descendant wordforms. For most of these I plan on still adding some additional details from other sources to these, but if anyone is interested in a copy for research purposes, feel free to get in touch e.g. by email. These are generally OpenOffice spreadsheets; a few are plain text.
There are moreover several projects I’ve started, but which after some initial work appear to be fairly daunting to finish; mainly distributional indices of large comparative or dialectological dictionaries. I plan on doing some scouting to verify if there is anyone else working on them as well, or interested in collaboration (of course, feel free to get proactively in touch with me if you happen to be such a person reading this).
- A distributional and morphological index of Karjalan kielen sanakirja (A–AI done)
- A distributional and morphological index of Vadja keele sõnaraamat (barely started)
- A distributional index of Dialektologisches und etymologisches Wörterbuch der ostjakischen Sprache (vowel-initial non-derived words done)
- A distributional index of Dravidian Etymological Dictionary (A and about half of Ā done)