Skip to main content
Nasjonalbiblioteket

OCR Models for Sámi Languages

Description

This is a collection of models for OCR (optical character recognition) of Sámi languages. These can be used to recognize text in images of printed text (scanned books, magazines, etc.) in North Sámi, South Sámi, Lule Sámi, and Inari Sámi.

You can read more detailed information about the training and evaluation of the models in the article 'Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway', see https://arxiv.org/abs/2501.07300.

The collection consists of three different types of models: Transkribus models, Tesseract models, and TrOCR models.

See the documentation file for more information.

Distributions
1

Download
Description:
Not provided
Access URL:
https://hdl.handle.net/21.11146/100
Direct download:
API:
Not provided
Documentation:
Not provided
License:
Conforms to:
Not provided

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

Norsk Ordbank - Norwegian Nynorsk 2005-2012Nasjonalbiblioteket
Public access
ONOMASTICA Pronunciation Lexicon 2Nasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
spaCy for Norwegian NynorskNasjonalbiblioteket
Public access