Skip to main content
Nasjonalbiblioteket

Norwegian Newspaper Corpus

Description

The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.

This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.

There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.

The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.


Similar datasets

SCARRIE LexiconNasjonalbiblioteket
Public access
Grapheme-to-Phoneme Models for NorwegianNasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
Texts from Norwegian WikipediaNasjonalbiblioteket
Public access