Skip to main content
Nasjonalbiblioteket

N-grams from NBdigital

Description

This resource contains n-grams - i.e. unigrams, bigrams and trigrams - from all books and newspapers that had been digitized at the National Library of Norway up to September 2013. The n-grams have been extracted from a material consisting of approximately 220,000 books and 540,000 newspapers.

The n-grams are available in two formats, CSV and SQlite: CSV is probably the most interesting format for most developers, because it is very easy to import these files into standard applications. The SQLite files contain indexed databases, which are used in the service NB N-gram. Users who want to contribute to the development of NB N-gram can download the source code on GitHub, and the SQLite databases from this page.

A word count by source (books/newspapers) and language variety (Bokmål/Nynorsk) is given in the json file.

Distributions
1

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

Norsk Ordbank - Norwegian Nynorsk 2005-2012Nasjonalbiblioteket
Public access
Translation Memories from Semantix ASNasjonalbiblioteket
Public access
NST Pronunciation Lexicon for SwedishNasjonalbiblioteket
Public access
spaCy for Norwegian NynorskNasjonalbiblioteket
Public access
Discussions from WikipediaNasjonalbiblioteket
Public access