Skip to main content
Nasjonalbiblioteket

Stortinget Speech Corpus version 1.0

  • Datasets
  • Public access 

    Publicly available to everyone. Access may still require registration and an API key request, as long as anyone can request such registration and/or API keys.

    Read more about access levels here

  • Open data 

    The dataset is classified as public access and has at least one distribution with an approved open license.

Description

The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian Parliament. It contains speech segments of up to 30 seconds with transcriptions in Norwegian Bokmål (nob) and Norwegian Nynorsk (nno) from the official proceedings.

The dataset is distributed as a JSONL file. Audio files, proceedings files and transcription files (with ASR output) are included in this repository, and there are relative file paths in the JSONL file. Note that only segmented audio files are part of the release.

Dataset statistics

  • Number of segments: 724 783
  • Total duration in hours: 5 190
  • Number of unique speakers: 729

For more detailed information, see the documentation files.

Distributions
1

Nameless distribution
  • gtar
Download

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

ONOMASTICA Pronunciation LexiconNasjonalbiblioteket
Public access
SNOMED CT – English Terms Translated to Norwegian Bokmål and Norwegian NynorskNasjonalbiblioteket
Public access
N-grams from NBdigitalNasjonalbiblioteket
Public access
Texts from Norwegian WikipediaNasjonalbiblioteket
Public access
Translation memories from Amesto Translations ASNasjonalbiblioteket
Public access