Add common datasets table

5c6c353f · Vlad-Andrei BĂDOIU (78692) · 12f0b56d · 5c6c353f
Commit 5c6c353f authored 1 year ago by Vlad-Andrei BĂDOIU (78692)
--- a/doc/datasets.md
+++ b/doc/datasets.md
+## Datasets in LLMs
+### Natural Language
+| Name              | Size     | License                                                                                                                                                                                                                                                                                              | Used-By | Source Paper                                                                                      | Language                                                                        | Link                                                                    | Obs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|-------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Wikipedia         | 80 GB    | cc-by-sa-3.0                                                                                                                                                                                                                                                                                         | Llama   |                                                                                                   | bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. | [huggingface](https://huggingface.co/datasets/wikipedia)                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| falcon-refinedweb | 1.2TB    | ODC-By 1.0                                                                                                                                                                                                                                                                                           | Falcon  | The RefinedWeb Dataset for Falcon LLM:                                                            |                                                                                 | [huggingface](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| SlimPajama        | 1TB+     | Apache License,                                                                                                                                                                                                                                                                                      | phi2    |                                                                                                   |                                                                                 | [huggingface](https://huggingface.co/datasets/cerebras/SlimPajama-627B) |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| Gutenberg         | 14 GB    | https://www.gutenberg.org/policy/license.html                                                                                                                                                                                                                                                        | Llama   | Outperforming Curated Corpora with Web Data, and Web Data Only                                    | en, de, fr, es, nl, pl, pt                                                      | [huggingface](https://huggingface.co/datasets/manu/project_gutenberg)   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| Books3            | 800 GB   | [ copyright infringement](https://news.ycombinator.com/item?id=37685313)                                                                                                                                                                                                                             | Llama   | The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 |                                                                                 |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| CommonCrawl       | 100 PB+  | The Common Crawl dataset includes copyrighted work and is distributed from the US under  fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work  around copyright law in other legal jurisdictions. | Llama   | https://commoncrawl.org/                                                                          | all                                                                             |                                                                         | Needs preprocessing. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references. |
+| C4                | 300 GB   | ODC-BY                                                                                                                                                                                                                                                                                               | Llama   | Exploring the limits of transfer learning with a unified text-to-text transformer.                | en                                                                              | [huggingface](https://huggingface.co/datasets/c4)                       | Cleaned up version of CommonCrawl from some years                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| ArXiv             | 92GB+    |                                                                                                                                                                                                                                                                                                      | Llama   |                                                                                                   |                                                                                 | https://github.com/paperswithcode/axcell                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| TinyStories       | 2GB      | CDLA-Sharing-1.0                                                                                                                                                                                                                                                                                     |         | TinyStories: How Small Can Language Models Be and Still Speak Coherent English?                   | en                                                                              | [huggingface](https://huggingface.co/datasets/roneneldan/TinyStories)   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| IMDB Reviews      | 100 MB   |                                                                                                                                                                                                                                                                                                      |         |                                                                                                   |                                                                                 |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| TedTalks          | 2GB      | cc-by-nc-nd-4.0       
\ No newline at end of file