Add gradient checkpointing papers

9c126a3a · Alexandru-Mihai GHERGHESCU · db8f2c94 · 9c126a3a
Unverified Commit 9c126a3a authored 1 year ago by Alexandru-Mihai GHERGHESCU
--- a/doc/intro2.md
+++ b/doc/intro2.md
@@ -29,6 +29,13 @@ useful for someone to get up-to-date with the NLP research of today.
 - ["Effective Approaches to Attention-based Neural Machine Translation" (Luong
  et al. - sep. 2015)](https://arxiv.org/abs/1508.04025) - one of the other
  Attention papers
+- ["Training Deep Nets with Sublinear Memory Cost" (Chen et al. - apr.
+  2016)](https://arxiv.org/abs/1604.06174) - introduces gradient checkpointing
+  for the first time, showing how to train larger models with little
+  computational overhead
+- ["Memory-Efﬁcient Backpropagation Through Time" (Gruslys et al. - jun.
+  2016)](https://arxiv.org/abs/1606.03401) - improves on gradient checkpointing
+  for RNNs
 - ["Layer Normalization" (Hinton et al. - jul.
  2016)](https://arxiv.org/abs/1607.06450) - the Layer Normalization paper, used
  by the original Transformer architecture