| Gutenberg | 14 GB | https://www.gutenberg.org/policy/license.html | Llama | Outperforming Curated Corpora with Web Data, and Web Data Only | en, de, fr, es, nl, pl, pt | [huggingface](https://huggingface.co/datasets/manu/project_gutenberg) | |
| Books3 | 800 GB | [ copyright infringement](https://news.ycombinator.com/item?id=37685313) | Llama | The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 | | | |
| CommonCrawl | 100 PB+ | The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions. | Llama | https://commoncrawl.org/ | all | | Needs preprocessing. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references. |
| C4 | 300 GB | ODC-BY | Llama | Exploring the limits of transfer learning with a unified text-to-text transformer. | en | [huggingface](https://huggingface.co/datasets/c4) | Cleaned up version of CommonCrawl from some years |
| TinyStories | 2GB | CDLA-Sharing-1.0 | | TinyStories: How Small Can Language Models Be and Still Speak Coherent English? | en | [huggingface](https://huggingface.co/datasets/roneneldan/TinyStories) | |