Skip to content
Snippets Groups Projects
Commit 5c6c353f authored by Vlad-Andrei BĂDOIU (78692)'s avatar Vlad-Andrei BĂDOIU (78692)
Browse files

Add common datasets table

parent 12f0b56d
No related branches found
No related tags found
1 merge request!6Add common datasets table
## Datasets in LLMs
### Natural Language
| Name | Size | License | Used-By | Source Paper | Language | Link | Obs |
|-------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Wikipedia | 80 GB | cc-by-sa-3.0 | Llama | | bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. | [huggingface](https://huggingface.co/datasets/wikipedia) | |
| falcon-refinedweb | 1.2TB | ODC-By 1.0 | Falcon | The RefinedWeb Dataset for Falcon LLM: | | [huggingface](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | |
| SlimPajama | 1TB+ | Apache License, | phi2 | | | [huggingface](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | |
| Gutenberg | 14 GB | https://www.gutenberg.org/policy/license.html | Llama | Outperforming Curated Corpora with Web Data, and Web Data Only | en, de, fr, es, nl, pl, pt | [huggingface](https://huggingface.co/datasets/manu/project_gutenberg) | |
| Books3 | 800 GB | [ copyright infringement](https://news.ycombinator.com/item?id=37685313) | Llama | The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 | | | |
| CommonCrawl | 100 PB+ | The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions. | Llama | https://commoncrawl.org/ | all | | Needs preprocessing. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references. |
| C4 | 300 GB | ODC-BY | Llama | Exploring the limits of transfer learning with a unified text-to-text transformer. | en | [huggingface](https://huggingface.co/datasets/c4) | Cleaned up version of CommonCrawl from some years |
| ArXiv | 92GB+ | | Llama | | | https://github.com/paperswithcode/axcell | |
| TinyStories | 2GB | CDLA-Sharing-1.0 | | TinyStories: How Small Can Language Models Be and Still Speak Coherent English? | en | [huggingface](https://huggingface.co/datasets/roneneldan/TinyStories) | |
| IMDB Reviews | 100 MB | | | | | | |
| TedTalks | 2GB | cc-by-nc-nd-4.0
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment