L1 Personalized Lexical Complexity Prediction (LCP) Dataset for German-Language Learners

Psycholinguistics | Second Language Acquisition (SLA) | Ontology | Semantic Web Technologies | Statistical Modelling

This project conducted an annotation task to create an LCP dataset for German-language learners. It aims to assess how word difficulty varies among individuals from language backgrounds that share roots and commonalities with the target language and those that do not. The dataset can be used to train a contextualized LCP model with personalized features.

Description of data

Unstructured continuous text from various domains was selected, with specific target words chosen from different levels of German texts. The dataset aims to collect features that can be analyzed for correlation with respect to the complexity of target words. For this purpose, various learner-specific (depending on the learner performing the annotation) and language-specific (based on the target language) features were accounted for. A complete list of all the different learn-specific and language-specific features are listed below.

Learner-specific features

FeatureDescription
speaker_idIndividual speaker id
german_levelThe current German of the participant in the CEFR scale
seen_beforePrevious exposure to the word
country_of_originCountry from which speaker comes from
native_languageNative language of speaker
backgroundProfessional background of the participant
years_in_germanyNumber of years the participant has lived in Germany

Language-specific features

FeatureDescription
domainDataset from which the text is from
msy_categoryMorphosyntactic category of the target word
en_cognateCognate value depending on whether it’s an English cognate or not
es_cognateIf there’s a Spanish cognate or not
es_ffIf there’s a false friend equivalent of the target word in Spanish
semantic_simSemantic similarity of the target word with the other words in the corpus
lengthLength of the presented text in number of words
dependency_depthThe syntactic dependency relation length of the presented target word
word_frequencyUsing wordfreq, determining the frequency of the target word
child_lex_frequencyBinary probability of the target word existing in the ChildLex frequency dataset (Schroeder et al. (2015)1)

Generating word difficulty scores

To generate complexity scores, for different target words for each participant, the GermaNet ontology was used to identify semantic distances and similarity of perceived meanings of different target words.

References

Footnotes

  1. Sascha Schroeder, Kay-Michael Würzner, Julian Heister, Alexander Geyken, and Reinhold Kliegl. 2015. childlex–eine lexikalische datenbank zur schriftsprache für kinder im deutschen. Psychologische Rundschau, 66:155–165.