Skip to main content
Nasjonalbiblioteket

Målfrid 2024 – Freely Available Documents from Norwegian State Institutions

Description

This corpus consists of documents from 497 domains of Norwegian state institutions and comprises approximately 2.6 billion tokens in total. In addition to Norwegian Bokmål and Nynorsk texts, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk on the domains of Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 2023 and January 2024, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

For technical information, please consult the documentation files.

Distributions
1

Download
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/99
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

Norsk Ordbank - Norwegian Nynorsk 2005-2012Nasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
Grapheme-to-Phoneme Models for NorwegianNasjonalbiblioteket
Public access
spaCy for Norwegian NynorskNasjonalbiblioteket
Public access