N-grams from NBdigital 2021

Description

This resource contains n-grams - i.e. unigrams, bigrams and trigrams - from all books and newspapers that had been digitized at the National Library of Norway up to July 2021. The n-grams have been extracted from a material consisting of approximately 580,000 books and 3,400,000 newspapers, amounting to a total of 122 billion tokens (words and punctuation). The n-grams are offered as CSV files (UTF-8-encoded).

Columns in the n-gram CSV files:

first - the first word (in uni-, bi- and trigrams)
second - the second word (in bi- and trigrams)
third - the third word (in trigrams)
lang - the language of the n-gram (only regarding books, newspapers have no language classification as for now)
freq - the total frequency of the n-gram in the collection of books or newspapers
json - a dictionary with raw frequency for each year

totals.json contains aggregated frequencies per year in the book and newspaper corpora. Using these numbers, relative frequencies can be calculated in order to compare frequencies over time as in NB N-gram.

metadata-digibok.csv and metadata-digavis.csv contain simple metadata for the books and newspapers. If you need more extensive metadata, you could use Oria or the APIs at https://api.nb.no/.

See the documentation files for further information.

Distributions
1

Download

Description:

Not provided

Access URL:

https://hdl.handle.net/21.11146/70

Direct download:

https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-totals.json
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-README-eng.md
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-digibok-unigram.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-digavis-unigram.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-digibok-bigram.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-README-nob.md
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-metadata-digavis.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/2021_NBngram.pdf
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-metadata-digibok.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-digavis-trigram.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-digibok-trigram.csv.gz
Generating preview...
https://www.nb.no/sbfil/ngram/ngram_2021/ngram-2021-digavis-bigram.csv.gz
Generating preview...

API:

Not provided

Documentation:

Not provided

License:

https://creativecommons.org/publicdomain/zero/1.0/

Conforms to:

Not provided

APIs providing this dataset
0

No registered APIs provide this dataset.

Similar datasets

Norsk Ordbank - Norwegian Nynorsk 2005-2012	Nasjonalbiblioteket	Public access
SCARRIE Lexicon	Nasjonalbiblioteket	Public access
ONOMASTICA Pronunciation Lexicon 2	Nasjonalbiblioteket	Public access
Translation Memories from Semantix AS	Nasjonalbiblioteket	Public access
Målfrid 2023 – Freely Available Documents from Norwegian State Institutions	Nasjonalbiblioteket	Public access

Did you find what you were looking for?

You can contact us here, or ask for help in our Community.

N-grams from NBdigital 2021

Description

Distributions1

Nameless distributiongz

APIs providing this dataset0

Similar datasets

Did you find what you were looking for?

Distributions
1

APIs providing this dataset
0