diff --git a/doc/datasets.md b/doc/datasets.md
new file mode 100644
index 0000000000000000000000000000000000000000..e69970ee5f0b433b97f43a95803b6759a075122c
--- /dev/null
+++ b/doc/datasets.md
@@ -0,0 +1,17 @@
+## Datasets in LLMs
+
+### Natural Language
+
+| Name              | Size     | License                                                                                                                                                                                                                                                                                              | Used-By | Source Paper                                                                                      | Language                                                                        | Link                                                                    | Obs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|-------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Wikipedia         | 80 GB    | cc-by-sa-3.0                                                                                                                                                                                                                                                                                         | Llama   |                                                                                                   | bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. | [huggingface](https://huggingface.co/datasets/wikipedia)                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| falcon-refinedweb | 1.2TB    | ODC-By 1.0                                                                                                                                                                                                                                                                                           | Falcon  | The RefinedWeb Dataset for Falcon LLM:                                                            |                                                                                 | [huggingface](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| SlimPajama        | 1TB+     | Apache License,                                                                                                                                                                                                                                                                                      | phi2    |                                                                                                   |                                                                                 | [huggingface](https://huggingface.co/datasets/cerebras/SlimPajama-627B) |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| Gutenberg         | 14 GB    | https://www.gutenberg.org/policy/license.html                                                                                                                                                                                                                                                        | Llama   | Outperforming Curated Corpora with Web Data, and Web Data Only                                    | en, de, fr, es, nl, pl, pt                                                      | [huggingface](https://huggingface.co/datasets/manu/project_gutenberg)   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| Books3            | 800 GB   | [ copyright infringement](https://news.ycombinator.com/item?id=37685313)                                                                                                                                                                                                                             | Llama   | The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 |                                                                                 |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| CommonCrawl       | 100 PB+  | The Common Crawl dataset includes copyrighted work and is distributed from the US under  fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work  around copyright law in other legal jurisdictions. | Llama   | https://commoncrawl.org/                                                                          | all                                                                             |                                                                         | Needs preprocessing. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references. |
+| C4                | 300 GB   | ODC-BY                                                                                                                                                                                                                                                                                               | Llama   | Exploring the limits of transfer learning with a unified text-to-text transformer.                | en                                                                              | [huggingface](https://huggingface.co/datasets/c4)                       | Cleaned up version of CommonCrawl from some years                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| ArXiv             | 92GB+    |                                                                                                                                                                                                                                                                                                      | Llama   |                                                                                                   |                                                                                 | https://github.com/paperswithcode/axcell                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| TinyStories       | 2GB      | CDLA-Sharing-1.0                                                                                                                                                                                                                                                                                     |         | TinyStories: How Small Can Language Models Be and Still Speak Coherent English?                   | en                                                                              | [huggingface](https://huggingface.co/datasets/roneneldan/TinyStories)   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| IMDB Reviews      | 100 MB   |                                                                                                                                                                                                                                                                                                      |         |                                                                                                   |                                                                                 |                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| TedTalks          | 2GB      | cc-by-nc-nd-4.0       
\ No newline at end of file