Skip to main content
Nasjonalbiblioteket

Texts from Norwegian Wikipedia

Description

This corpus is a dump from approximately March 20 2019 of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. The corpus contains 492,864 articles for Norwegian Bokmål, 139,927 articles for Norwegian Nynorsk and 7,626 articles for Northern Sami. The files are structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of "key:value" pairs containing text and metadata. There are eight such key:value pairs per article:

  • bytelength: length of text in number of bytes
  • pageid: text identifier
  • title: title as in Wikipedia
  • hiddencategories: metadata
  • text: text as in Wikipedia
  • revised: audit information
  • contentcategories: metadata
  • wikidata: other data

An example of the JSON format can be found in the documentation file.

Distributions
1

Nameless distribution
  • gtar
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/50
Status:
Not provided
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided
Rights for use:
Not provided
Download

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

ONOMASTICA Pronunciation LexiconNasjonalbiblioteket
Public access
SNOMED CT – English Terms Translated to Norwegian Bokmål and Norwegian NynorskNasjonalbiblioteket
Public access
N-grams from NBdigitalNasjonalbiblioteket
Public access
Translation memories from Amesto Translations ASNasjonalbiblioteket
Public access
Grapheme-to-Phoneme Models for Norwegian BokmålNasjonalbiblioteket
Public access