Orthographic differences among Germanic languages: Stem variation versus inflectional affix variation by Wilbert Heeringa, Femke Swarte, Anja Schüppert and Charlotte Gooskens

Sometimes readers are confronted with texts written in a language which is unknown to them. If the linguistic differences are too large, for example because many words are non-cognates, i.e. have different linguistic origins, or when a different writing system is used, the meaning of the texts will be difficult to understand. However, when the language is closely related, which will be reflected in a large number of cognates, and when it is written in a familiar writing system, a certain degree of intelligibility may be attained.

Apart from the number of non-cognates and syntactic differences, the intelligibility of the text depends on the extent to which the written form of the cognate words in the text differs from that of the language of the reader. In our study we distinguish between differences in the stem of the word and in inflectional affixes. The stem can be a root, a compound or derivational complex. When comparing the stem of a word in a language with the stem of a corresponding word in another language, we consider both measurements where the words are required to be cognates and measurements in which non-cognate comparisons are also allowed. The inflectional affixes can be cognates (e.g. plural -s in Dutch studies and English studies) or non-cognates (e.g. Dutch regels 'rules' versus German Regeln). Both components of the word may differ across the two languages as a result of differences in spelling conventions (e.g. Dutch uu and German ü have the same pronunciation but are written differently) and differences in pronunciation (e.g. Dutch helpen 'to help' versus German helfen). We hypothesize that orthographic differences in the stem, which is generally considered to have a large information loading, will have a larger effect on intelligibility than differences in the inflectional affixes.

This study takes place in the context of a larger research programme which examines the communication possibilities between speakers of closely related languages within the Germanic, Romance and Slavic language groups. In the future we will correlate our orthographic distances (and its two components, namely stem-based distance and affix-based distance) to scores obtained by a large-scale web-based study of the intelligibility of written language. In the present contribution we focus on the following research questions:

  1. Do aggregated orthographic stem distances between languages correlate with aggregated distances obtained on the basis of differences between inflectional affixes?
  2. Do invididual orthographic stem distances of the word pairs of a language pair correlate with the orthographic distances of the corresponding inflectional affixes?
  3. Are distances between languages measured on the basis of orthographic differences in word stems relatively larger or smaller than distances between languages measured on the basis of orthographic differences in inflectional affixes?

Our study is based on five Germanic languages, namely Danish, Dutch, English, German and Swedish. We looked at the pairwise orthographic distances between nouns, verbs and adjectives in four texts in English and translations from English to the other four languages. All words are broken up into the stem and inflectional affixes (if present). The orthographic differences between the word components are measured by means of Levenshtein distance, which finds the minimum sum of the weights of the operations required to change one word (of one language) into another word (of another language) by inserting, deleting of substituting characters.