Skip to main content
Nasjonalbiblioteket

Translation Memory from Doffin

Description

This corpus contains data from Doffin, the Norwegian web-based database for notices of public procurement and procurement in the utility sector, managed by The Norwegian Agency for Public and Financial Management.

The Language Bank received the data in the form of an XML database dump. The dump consisted of 41,143 document pairs (original and translation). 40,631 of these were translations from Norwegian to English. Only the latter are included in the corpus. Of the originally Norwegian documents, 39,893 were in Norwegian Bokmål and 736 in Norwegian Nynorsk.

Original and translation were first aligned on document level using an internal document identifier, then the sentences were extracted using the NLTK Punkt Sentence Tokenizer and aligned using Hunalign. Duplicate translations (exact duplicates) were discarded.

We recorded a total of 293,649 translation units (TUs) for Norwegian Bokmål to English, and 6,342 TUs for Norwegian Nynorsk to English. A TU is a translation pair with an original text and a parallelized translation, and usually corresponds to a more or less meaningful linguistic unit, typically a sentence, a heading etc. A TU may also consist of a single word or several clauses. The translation units for the two languages are distributed as two separate files, both in TMX 1.4 format (a variant of XML).

Distributions
1

Download
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/63
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

Norsk Ordbank - Norwegian Nynorsk 2005-2012Nasjonalbiblioteket
Public access
ONOMASTICA Pronunciation Lexicon 2Nasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
spaCy for Norwegian NynorskNasjonalbiblioteket
Public access