From 5c6c353f8d2a1588155c9b7c3e3f34afaa29cb65 Mon Sep 17 00:00:00 2001
From: Vlad Badoiu <vlad_andrei.badoiu@upb.ro>
Date: Thu, 4 Jan 2024 22:46:40 +0200
Subject: [PATCH] Add common datasets table

---
 doc/datasets.md | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100644 doc/datasets.md

diff --git a/doc/datasets.md b/doc/datasets.md
new file mode 100644
index 0000000..e69970e
--- /dev/null
+++ b/doc/datasets.md
@@ -0,0 +1,17 @@
+## Datasets in LLMs
+
+### Natural Language
+
+| Name              | Size     | License                                                                                                                                                                                                                                                                                              | Used-By | Source Paper                                                                                      | Language                                                                        | Link                                                                    | Obs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|-------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Wikipedia         | 80 GB    | cc-by-sa-3.0                                                                                                                                                                                                                                                                                         | Llama   |                                                                                                   | bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. | [huggingface](https://huggingface.co/datasets/wikipedia)                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| falcon-refinedweb | 1.2TB    | ODC-By 1.0                                                                                                                                                                                                                                                                                           | Falcon  | The RefinedWeb Dataset for Falcon LLM:                                                            |                                                                                 | [huggingface](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| SlimPajama        | 1TB+     | Apache License,                                                                                                                                                                                                                                                                                      | phi2    |                                                                                                   |                                                                                 | [huggingface](https://huggingface.co/datasets/cerebras/SlimPajama-627B) |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| Gutenberg         | 14 GB    | https://www.gutenberg.org/policy/license.html                                                                                                                                                                                                                                                        | Llama   | Outperforming Curated Corpora with Web Data, and Web Data Only                                    | en, de, fr, es, nl, pl, pt                                                      | [huggingface](https://huggingface.co/datasets/manu/project_gutenberg)   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| Books3            | 800 GB   | [ copyright infringement](https://news.ycombinator.com/item?id=37685313)                                                                                                                                                                                                                             | Llama   | The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 |                                                                                 |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| CommonCrawl       | 100 PB+  | The Common Crawl dataset includes copyrighted work and is distributed from the US under  fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work  around copyright law in other legal jurisdictions. | Llama   | https://commoncrawl.org/                                                                          | all                                                                             |                                                                         | Needs preprocessing. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references. |
+| C4                | 300 GB   | ODC-BY                                                                                                                                                                                                                                                                                               | Llama   | Exploring the limits of transfer learning with a unified text-to-text transformer.                | en                                                                              | [huggingface](https://huggingface.co/datasets/c4)                       | Cleaned up version of CommonCrawl from some years                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| ArXiv             | 92GB+    |                                                                                                                                                                                                                                                                                                      | Llama   |                                                                                                   |                                                                                 | https://github.com/paperswithcode/axcell                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| TinyStories       | 2GB      | CDLA-Sharing-1.0                                                                                                                                                                                                                                                                                     |         | TinyStories: How Small Can Language Models Be and Still Speak Coherent English?                   | en                                                                              | [huggingface](https://huggingface.co/datasets/roneneldan/TinyStories)   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| IMDB Reviews      | 100 MB   |                                                                                                                                                                                                                                                                                                      |         |                                                                                                   |                                                                                 |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| TedTalks          | 2GB      | cc-by-nc-nd-4.0       
\ No newline at end of file
-- 
GitLab