TeflonNorL2 NOCASA Challenge Dataset

Beskrivelse

This is a specialized version of the data set that has been used for the Non-native Children’s Automatic Speech Assessment Challenge (NOCASA), https://teflon.aalto.fi/nocasa-2025/, hosted by the IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025, https://2025.ieeemlsp.org/en/

The full dataset is described here:

Anne Marte Haug Olstad, Anna Smolander, Sofia Strömbergsson, Sari Ylinen, Minna Lehtonen, Mikko Kurimo, Yaroslav Getman, Tamás Grósz, Xinwei Cao, Torbjørn Svendsen, and Giampiero Salvi. 2024. Collecting Linguistic Resources for Assessing Children’s Pronunciation of Nordic Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3529–3537, Torino, Italia. ELRA and ICCL.

The specialized version of the data and the challenge are described here:

Getman, Y., Grósz, T., Kurimo, M., & Salvi, G. (2025). "Non-native Children's Automatic Speech Assessment Challenge (NOCASA)". IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Istanbul, Turkey

Compared to the full dataset a number of modifications have been made to the challenge data:

some recordings were excluded
the data was split into training and test set following a procedure that should keep a similar distribution of speaker characteristics
the file names were anonymized to hide the speaker identities (it should not be possible to infer which recordings correspond to the same speaker)
metadata was limited to orthographic transcription and assessment score for the training data and only orthographic transcription for the test data

Here, we also release assessment scores for the test data separately.

Files:

train_audio.tgz: audio files for the training set
test_audio.tgz: audio files for the test set
train.csv.gz: metadata for the training data (orthographic transcriptions and assessment scores)
test.csv.gz: metadata for the test data (orthographic transcriptions)
test_full.csv.gz: metadata for the test data (orthographic transcriptions and assessment scores)

Scroll down to download the files.

Contact professor Giampiero Salvi (giampiero.salvi@ntnu.no) at NTNU if you have any questions about the dataset.

Distribusjoner
1

Last ned

Beskrivelse:

Ikke oppgitt

TilgangsURL:

https://hdl.handle.net/21.11146/94

Direkte nedlastning:

https://www.nb.no/sbfil/teflon/test.csv.gz
Genererer forhåndsvisning...
https://www.nb.no/sbfil/teflon/train.csv.gz
Genererer forhåndsvisning...
https://www.nb.no/sbfil/teflon/test_audio.tgz
Genererer forhåndsvisning...
https://www.nb.no/sbfil/teflon/train_audio.tgz
Genererer forhåndsvisning...
https://www.nb.no/sbfil/teflon/test_full.csv.gz
Genererer forhåndsvisning...

API:

Ikke oppgitt

Dokumentasjon:

Ikke oppgitt

Lisens:

Ikke oppgitt

I samsvar med:

Ikke oppgitt

API-er som tilgjengeliggjør dette datasettet
0

Ingen registrerte API-er tilgjengeliggjør dette datasettet.

Lignende datasett

Norsk ordbank - nynorsk 2005-2012	Nasjonalbiblioteket	Allmenn tilgang
ONOMASTICA uttaleleksikon 2	Nasjonalbiblioteket	Allmenn tilgang
Omsetjingsminne frå Semantix AS	Nasjonalbiblioteket	Allmenn tilgang
NST uttaleleksikon for svensk	Nasjonalbiblioteket	Allmenn tilgang
Grafem-til-fonem-modeller for norsk	Nasjonalbiblioteket	Allmenn tilgang

Finner du det du leter etter?

Ta kontakt med oss her, eller spør om hjelp i Datalandsbyen.

TeflonNorL2 NOCASA Challenge Dataset

Beskrivelse

Distribusjoner1

Navnløs distribusjonapplication/x-tgz , application/x-gzip

API-er som tilgjengeliggjør dette datasettet0

Lignende datasett

Finner du det du leter etter?

Distribusjoner
1

API-er som tilgjengeliggjør dette datasettet
0