Skip to main content

Fineweb-2

This is the second iteration of the popular FineWeb dataset, bringing high quality pretraining data to over 1000 languages.

The FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.

The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove, our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly 8 terabytes of compressed text data, with almost 3 trillion words (see How many tokens? for more details). For PII and opt-out see Personal and Sensitive Information and opt-out.

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/b2750f44-b4e6-45a1-b6fd-9d8ac73d0c3a
Destinationsside https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
Høstes af Datavejviser Nej
Udgivelsesdato 08-01-2025
Seneste ændringsdato 08-01-2025
Opdateringsfrekvens opdateres løbende
Dækningsperiode  / 
Emne(r) Regeringen og den offentlige sektor
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn

This dataset originates from Common Crawl. The use of this dataset is also subject to CommonCrawl's Terms of Use.

Dokumentation