Skip to main content
Nasjonalbiblioteket

Målfrid 2021 - Freely Available Documents from Norwegian State Institutions

Distributions 
1
APIs 
0
No registered APIs provide this dataset.
OverviewDistributions & APIs 
1
DetailsDiscussions 
0
RDF

Description

This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR.

The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates).

The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys:

  • lang: language of the document (detected using TextCat)
  • url: the url of the document at crawl time
  • date: crawl date
  • mimetype: media type of the document (simplified): HTML, DOC or PDF
  • fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents

Distributions
1

Nameless distribution
  • gtar
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/69
Status:
Not provided
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided
Rights for use:
Not provided
Download

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

SCARRIE LexiconNasjonalbiblioteket
Public access
Grapheme-to-Phoneme Models for NorwegianNasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
Texts from Norwegian WikipediaNasjonalbiblioteket
Public access

Distributions
1

Nameless distribution
  • gtar
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/69
Status:
Not provided
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided
Rights for use:
Not provided
Download

APIs providing this dataset
0

No registered APIs provide this dataset.

Contact information

Contact point:
Not provided
Website:
https://www.nb.no/sprakbanken/
Email:
sprakbanken@nb.no
Telephone:
Not provided

About the data

Language:
, , , , ,
Content providers:
Not provided
Provenance:
Not provided
Update frequency:
Not provided
First issued:

This date indicates when the data in this dataset was first released. It may have happened before the dataset was published on data.norge.no.

December 1, 2020
Last updated:
April 30, 2021
Accuracy:
Not provided
Availability:
Not provided
Completeness:
Not provided
Currentness:
Not provided
Relevance:
Not provided
Geographical scope:
Not provided
Temporal scope:
Not provided
Conforms to:

Reference to an implementation rule or other specification that forms the basis for the dataset.

Not provided

Legal basis

Not provided

Concepts used in the dataset

Not provided

References

Not provided

About this dataset

Publisher:
Nasjonalbiblioteket
Published:

This date indicates when the dataset was harvested by data.norge.no. It may have been available earlier elsewhere.

Read more about harvesting here

March 3, 2026
Last updated:
March 13, 2026
Landing page:
Not provided
Documentation:
Not provided
Dataset type:
Not provided
Metadata Quality:

Metadata quality is an indicator of how well the datasets are described using metadata.

Read more about metadata quality here

Good (59%)
URI:

Themes

Keywords

Not provided

Discussions on Datalandsbyen
0

No discussions found

What is Datalandsbyen?

Datalandsbyen is our online forum where you can request data, share experiences, and ask for advice related to data sharing and information management.