Datasæt

Fineweb-2

This is the second iteration of the popular FineWeb dataset, bringing high quality pretraining data to over 1000 languages.

The FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.

The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove, our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly 8 terabytes of compressed text data, with almost 3 trillion words (see How many tokens? for more details). For PII and opt-out see Personal and Sensitive Information and opt-out.

Data og ressourcer

Fineweb-2 dansk - Test- og træningsdatahttp://publications.europa.eu/resource/authority/file-type/PARQUET
Tilgå ressourcen her.
Udforsk
- Mere information
- Gå til ressource
Fineweb-2 dansk - Frasorteret datahttp://publications.europa.eu/resource/authority/file-type/PARQUET
Tilgå ressourcen her
Udforsk
- Mere information
- Gå til ressource

Nøgleord

Yderligere info

URI	https://data.gov.dk/dataset/lang/b2750f44-b4e6-45a1-b6fd-9d8ac73d0c3a
Destinationsside	https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
Høstes af Datavejviser	Nej
Udgivelsesdato	08-01-2025
Seneste ændringsdato	08-01-2025
Opdateringsfrekvens	opdateres løbende
Dækningsperiode	/
Emne(r)	Regeringen og den offentlige sektor
Adgangsrettigheder	offentlig
Overholder
Proveniensudsagn	This dataset originates from Common Crawl. The use of this dataset is also subject to CommonCrawl's Terms of Use.
Dokumentation	https://huggingface.co/datasets/HuggingFaceFW/fineweb-2/blob/main/README.md