Ressourcer - sprogteknologi.dk

Udtræk og opmærkning af DanPASS

Data indeholder den oprindelige textgridinformation i DanPASS-korpusset og ekstra opmærkning af korpusset, omorganiseret i semikolonseparerede kolonner i en txt-fil.

TXT

DaAnonymization

DaAnonymization er en anonymiserings pipeline, der giver nem adgang til anonymisering af dansk tekst ved brug af DaCy's entitetsgenkendelse og regulære ekspressioner. Værktøjet...

Python

DaLUKE

DaLUKE er udviklet i forbindelse med et bachelorprojekt i Kunstig Intelligens og Data ved Danmarks Tekniske Universitet. DaLUKE er en dansk version af LUKE, som er en...

Python

Leipzig Corpora Collection

The Leipzig Corpora Collection provides different tools and data for download, which are protected by copyright. For more details please refer to our terms of usage....

TXT

Ælæctra

Ælæctra er en transformer-baseret NLP sprogmodel, der er udarbejdet ved at benytte prætræningsmetoden ELECTRA-Small på The Danish Gigaword Projects datasæt (Der henvises til...

BIN

Udtaleordbog.dk

Udtaleordbog.dk er en online ordbog med danske ord og deres bøjningsformer transskriberet i IPA-lydskrift. Ordbogen gengiver både moderne udtale, konservativ udtale, mindre...

HTML
TXT

NB-BERT

"NB-BERT-base is a general BERT-base model built on the large digital collection at the National Library of Norway. This model is based on the same structure as BERT Cased...

BIN

Dansk ELECTRA

ELECTRA model prætrænet på dansk, på 17,5 GB data. Du kan læse mere om ELECTRA træningsmetoden i denne forskningsartikel: ELECTRA: Pre-training Text Encoders as Discriminators...

BIN

KlimaBERT

KlimaBERT er et værktøj, som kan identificere og analysere politiske citater, der er relaterede til klima. Modellen fungerer bedst ved brug af officielle tekster fra...

BIN

RøBÆRTa

RøBÆRTa er en dansk præ-trænet Roberta sprogmodel. RøBÆRTa er blevet trænet på det danske mC4 datasæt i forbindelse med flax community week. Modellen er trænet til at gætte et...

BIN

Dansk ConvBERT

To forskellige størrelser ConvBERT modeller prætrænet på dansk tekstdata (omtrent 17,5 GB data). Til prætræning er der anvendt ELECTRA Pretraining metoden. ConvBERT er en...

BIN

Bilingual English-Danish parallel corpus from The Viking Ship Museum website

Contents of https://www.vikingeskibsmuseet.dk were crawled, aligned on document and sentence level and converted into a parallel corpus. Contains 12403 translation units (EN-...

TMX

Danish Legal monolingual corpus from the contents of the retsinformation.dk web site

Danish Legal monolingual corpus from the contents of the retsinformation.dk web site This dataset has been created within the framework of the European Language Resource...

TXT

COVID-19 EC-EUROPA v1 dataset. Bilingual (EN-DA)

Bilingual (EN-DA) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). Contains 2803 translation units (DA-EN).

TMX

Covid-19 EUR-LEX dataset. Bilingual (EN-DA)

Bilingual (EN-DA) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020). Contains 21238 translations units (DA-EN)

TMX

COVID-19 EUROPARL dataset v2. Bilingual (EN-DA)

Bilingual (EN-DA) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020). Contains 633 translation units (DA-EN).

TMX

COVID-19 ANTIBIOTIC dataset. Bilingual (EN-DA)

This dataset has been generated out of public content available through the portal (https://antibiotic.ecdc.europa.eu/) of the European Centre for Disease Prevention and Control...

TMX

COVID-19 EU presscorner v2 dataset. Bilingual (EN-DA)

Bilingual (EN-DA) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). Contains 6261 translation units (DA-EN).

TMX

Bilingual corpus made out of PDF documents from the European Medicines...

EN-DA Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020). Attribution details: This dataset has...

TMX

Compilation of Danish-English parallel corpora resources used for training...

Dette tosproget korpora er bygget af en række forskellige korpusser fra udvalgte offentlige og private korpus og er blevet brugt til at træne NTEU (Neural Translation for the...

TMX

191 ressourcer fundet