Skip to main content
Nasjonalbiblioteket

Norwegian Newspaper Corpus

Description

The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.

This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.

There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.

The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.

Distributions
1

Download
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/4
Direct download:
  1. https://www.nb.no/sbfil/tekst/nak_2014.tar
    Generating preview...
  2. https://www.nb.no/sbfil/tekst/nak_2017.tar
    Generating preview...
  3. https://www.nb.no/sbfil/tekst/nak_2016.tar
    Generating preview...
  4. https://www.nb.no/sbfil/tekst/nak_2012.tar
    Generating preview...
  5. https://www.nb.no/sbfil/tekst/nak_2019.tar
    Generating preview...
  6. https://www.nb.no/sbfil/tekst/nak_2015.tar
    Generating preview...
  7. https://www.nb.no/sbfil/tekst/nak_2013.tar
    Generating preview...
  8. https://www.nb.no/sbfil/tekst/nak_2018.tar
    Generating preview...
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
Grapheme-to-Phoneme Models for NorwegianNasjonalbiblioteket
Public access
SCARRIE LexiconNasjonalbiblioteket
Public access
ONOMASTICA Pronunciation LexiconNasjonalbiblioteket
Public access
N-grams from NBdigital 2021Nasjonalbiblioteket
Public access