Skip to main content
Nasjonalbiblioteket

Texts from Norwegian Wikipedia

Description

This corpus is a dump from approximately March 20 2019 of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. The corpus contains 492,864 articles for Norwegian Bokmål, 139,927 articles for Norwegian Nynorsk and 7,626 articles for Northern Sami. The files are structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of "key:value" pairs containing text and metadata. There are eight such key:value pairs per article:

  • bytelength: length of text in number of bytes
  • pageid: text identifier
  • title: title as in Wikipedia
  • hiddencategories: metadata
  • text: text as in Wikipedia
  • revised: audit information
  • contentcategories: metadata
  • wikidata: other data

An example of the JSON format can be found in the documentation file.

Distributions
1

Download
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/50
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

Norsk Ordbank - Norwegian Nynorsk 2005-2012Nasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
Grapheme-to-Phoneme Models for NorwegianNasjonalbiblioteket
Public access
spaCy for Norwegian NynorskNasjonalbiblioteket
Public access