Introducting the LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish
Joint efforts between the CLIMAD and ADIM projects sought to create a multilingual speech corpus to investigate cross-linguistic influence (CLI) during multilingual language use (L3/Ln) across various domains, language settings and stages of acquisition. The result of this effort was the LnNor speech corpus.
The corpus has been published under an open license in three repositories:
The LnNOR corpus part 1 consists of 1073 annotated files from 78 speakers. The speakers included 53 L1 Polish, 16 L1 Norwegian and 9 L1 speakers of other European languages. The total recording time is approximately 35 hours and the full size is 18 GB. The recordings in the released LnNor corpus part 1 cover data collected between 2021-2022.
The LnNOR corpus part 2 consists of 1671 annotated files from 164 speakers. The speakers included 113 L1 Polish, 33 L1 Norwegian and 18 L1 speakers of English. The total recording time is approximately 59 hours and the full size is 26 GB. The recordings in the released LnNor corpus part 2 cover data collected between 2023-2024.
All speech samples were recorded with the use of Shure SM-35 unidirectional cardioid head-worn condenser microphones, using portable Marantz PMD620 solid state recorders with signal digitized at 48 kHz, 16-bit. This set-up was selected to minimize ambient noise and provide clear and focused recordings.
About the LnNor Corpus
The LnNor corpus has been created to represent multilingual speech with a focus on L3/Ln Norwegian learners as well as native controls of Norwegian, English and Polish. The corpus was constructed to study linguistic variation in learners acquiring Norwegian as a foreign language in instructed and naturalistic settings. Additionally, a subcorpus of native speech patterns is provided to serve as a benchmark, against which the learners’ productions could be compared. Furthermore, our corpus contains word alignment with orthographic transcriptions of speech to facilitate subsequent analyses across various linguistic domains.
A range of sophisticated methodologies, such as perception and production tests, grammaticality judgement tasks and online brain imaging techniques like EEG, were leveraged to unravel the intricacies of multilingual processing. By capturing real-time insights into the interplay of cross-linguistic influences, the projects not only provided valuable contributions to the understanding of L3/Ln acquisition but also advanced theoretical frameworks in this field.
Corpus data collection covered a broad range of speech elicitation tasks. The recordings consist of word, sentence and text reading, picture story description, video story retelling, spontaneous speech and socio-phonetic interviews in Polish, English and Norwegian. The corpus contains metadata based on the Language History Questionnaire (Li et al. 2020) such as age, gender, native languages, proficiency level, length of language exposure, age of onset.
Data was collected from different groups of speakers:
- L1 Polish learners of Norwegian as L3/Ln, attending Scandinavian studies at Poznań College of Modern Languages and the University of Szczecin (instructed learners);
- L1 Polish learners of Norwegian as L3/Ln, living in Norway (naturalistic learners)
- L1 English natives as controls
- L1 Norwegian natives as controls
- speakers of L2/L3/Ln English and L2/L3/Ln Norwegian with various L1 backgrounds
Four types of speech tasks were recorded in Norwegian, English and Polish:
- word reading
- sentence reading
- text reading (“The North Wind and the Sun”)
- picture story telling
Metadata corresponding to the recordings include the following information:
- speaker ID, age, gender, education, city, region, country, speaker status, (instructed/naturalistic/mixed), native language, additional language spoken, proficiency levels, etc.
- recording ID
- language: PL (Polish), EN (English), NO (Norwegian)
- status: L1, L2, L3/Ln
- mode: TR1/2 (text reading), SR1/2 (sentence reading), WR (word reading), PD (picture description), ST (story telling), VT (video story telling)
- recording date, recording place, iteration, recording environment, recording device, type of microphone, noise level, etc.