This is a specialized version of the data set that has been used for the Non-native Children’s Automatic Speech Assessment Challenge (NOCASA), https://teflon.aalto.fi/nocasa-2025/, hosted by the IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025, https://2025.ieeemlsp.org/en/
The full dataset is described here:
Anne Marte Haug Olstad, Anna Smolander, Sofia Strömbergsson, Sari Ylinen, Minna Lehtonen, Mikko Kurimo, Yaroslav Getman, Tamás Grósz, Xinwei Cao, Torbjørn Svendsen, and Giampiero Salvi. 2024. Collecting Linguistic Resources for Assessing Children’s Pronunciation of Nordic Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3529–3537, Torino, Italia. ELRA and ICCL.
The specialized version of the data and the challenge are described here:
Getman, Y., Grósz, T., Kurimo, M., & Salvi, G. (2025). "Non-native Children's Automatic Speech Assessment Challenge (NOCASA)". IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Istanbul, Turkey
Compared to the full dataset a number of modifications have been made to the challenge data:
- some recordings were excluded
- the data was split into training and test set following a procedure that should keep a similar distribution of speaker characteristics
- the file names were anonymized to hide the speaker identities (it should not be possible to infer which recordings correspond to the same speaker)
- metadata was limited to orthographic transcription and assessment score for the training data and only orthographic transcription for the test data
Here, we also release assessment scores for the test data separately.
Files:
- train_audio.tgz: audio files for the training set
- test_audio.tgz: audio files for the test set
- train.csv.gz: metadata for the training data (orthographic transcriptions and assessment scores)
- test.csv.gz: metadata for the test data (orthographic transcriptions)
- test_full.csv.gz: metadata for the test data (orthographic transcriptions and assessment scores)