L1 Personalized Lexical Complexity Prediction (LCP) Dataset for German-Language Learners
Psycholinguistics | Second Language Acquisition (SLA) | Ontology | Semantic Web Technologies | Statistical Modelling
This project conducted an annotation task to create an LCP dataset for German-language learners. It aims to assess how word difficulty varies among individuals from language backgrounds that share roots and commonalities with the target language and those that do not. The dataset can be used to train a contextualized LCP model with personalized features.
Description of data
Unstructured continuous text from various domains was selected, with specific target words chosen from different levels of German texts. The dataset aims to collect features that can be analyzed for correlation with respect to the complexity of target words. For this purpose, various learner-specific (depending on the learner performing the annotation) and language-specific (based on the target language) features were accounted for. A complete list of all the different learn-specific and language-specific features are listed below.
Learner-specific features
Feature | Description |
---|---|
speaker_id | Individual speaker id |
german_level | The current German of the participant in the CEFR scale |
seen_before | Previous exposure to the word |
country_of_origin | Country from which speaker comes from |
native_language | Native language of speaker |
background | Professional background of the participant |
years_in_germany | Number of years the participant has lived in Germany |
Language-specific features
Feature | Description |
---|---|
domain | Dataset from which the text is from |
msy_category | Morphosyntactic category of the target word |
en_cognate | Cognate value depending on whether it’s an English cognate or not |
es_cognate | If there’s a Spanish cognate or not |
es_ff | If there’s a false friend equivalent of the target word in Spanish |
semantic_sim | Semantic similarity of the target word with the other words in the corpus |
length | Length of the presented text in number of words |
dependency_depth | The syntactic dependency relation length of the presented target word |
word_frequency | Using wordfreq, determining the frequency of the target word |
child_lex_frequency | Binary probability of the target word existing in the ChildLex frequency dataset (Schroeder et al. (2015)1) |
Generating word difficulty scores
To generate complexity scores, for different target words for each participant, the GermaNet ontology was used to identify semantic distances and similarity of perceived meanings of different target words.
References
Footnotes
-
Sascha Schroeder, Kay-Michael Würzner, Julian Heister, Alexander Geyken, and Reinhold Kliegl. 2015. childlex–eine lexikalische datenbank zur schriftsprache für kinder im deutschen. Psychologische Rundschau, 66:155–165. ↩