Hopp til hovedinnhold
Nasjonalbiblioteket

TeflonNorL2 NOCASA Challenge Dataset

Beskrivelse

This is a specialized version of the data set that has been used for the Non-native Children’s Automatic Speech Assessment Challenge (NOCASA), https://teflon.aalto.fi/nocasa-2025/, hosted by the IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025, https://2025.ieeemlsp.org/en/

The full dataset is described here:

Anne Marte Haug Olstad, Anna Smolander, Sofia Strömbergsson, Sari Ylinen, Minna Lehtonen, Mikko Kurimo, Yaroslav Getman, Tamás Grósz, Xinwei Cao, Torbjørn Svendsen, and Giampiero Salvi. 2024. Collecting Linguistic Resources for Assessing Children’s Pronunciation of Nordic Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3529–3537, Torino, Italia. ELRA and ICCL.

The specialized version of the data and the challenge are described here:

Getman, Y., Grósz, T., Kurimo, M., & Salvi, G. (2025). "Non-native Children's Automatic Speech Assessment Challenge (NOCASA)". IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Istanbul, Turkey

Compared to the full dataset a number of modifications have been made to the challenge data:

  • some recordings were excluded
  • the data was split into training and test set following a procedure that should keep a similar distribution of speaker characteristics
  • the file names were anonymized to hide the speaker identities (it should not be possible to infer which recordings correspond to the same speaker)
  • metadata was limited to orthographic transcription and assessment score for the training data and only orthographic transcription for the test data

Here, we also release assessment scores for the test data separately.

Files:

  • train_audio.tgz: audio files for the training set
  • test_audio.tgz: audio files for the test set
  • train.csv.gz: metadata for the training data (orthographic transcriptions and assessment scores)
  • test.csv.gz: metadata for the test data (orthographic transcriptions)
  • test_full.csv.gz: metadata for the test data (orthographic transcriptions and assessment scores)

Scroll down to download the files.

Contact professor Giampiero Salvi (giampiero.salvi@ntnu.no) at NTNU if you have any questions about the dataset.

Distribusjoner
1

Last ned
Beskrivelse:
Ikke oppgitt
TilgangsURL:
https://hdl.handle.net/21.11146/94
Direkte nedlastning:
  1. https://www.nb.no/sbfil/teflon/test.csv.gz
    Genererer forhåndsvisning...
  2. https://www.nb.no/sbfil/teflon/train.csv.gz
    Genererer forhåndsvisning...
  3. https://www.nb.no/sbfil/teflon/test_audio.tgz
    Genererer forhåndsvisning...
  4. https://www.nb.no/sbfil/teflon/train_audio.tgz
    Genererer forhåndsvisning...
  5. https://www.nb.no/sbfil/teflon/test_full.csv.gz
    Genererer forhåndsvisning...
API:
Ikke oppgitt
Dokumentasjon:
Ikke oppgitt
Lisens:
Ikke oppgitt
I samsvar med:
Ikke oppgitt

API-er som tilgjengeliggjør dette datasettet
0

Ingen registrerte API-er tilgjengeliggjør dette datasettet.

Lignende datasett

Norsk ordbank - nynorsk 2005-2012Nasjonalbiblioteket
Allmenn tilgang
ONOMASTICA uttaleleksikon 2Nasjonalbiblioteket
Allmenn tilgang
Omsetjingsminne frå Semantix ASNasjonalbiblioteket
Allmenn tilgang
NST uttaleleksikon for svenskNasjonalbiblioteket
Allmenn tilgang
Grafem-til-fonem-modeller for norskNasjonalbiblioteket
Allmenn tilgang