Out-of-Memory from context length caused by the KV cache?
Aisde from the model, we also use [2 * [bytes per token] * [layers] * [heads] * [dimensions per head] * [context length (num_tokens)] for the KV cache.
Aisde from the model, we also use [2 * [bytes per token] * [layers] * [heads] * [dimensions per head] * [context length (num_tokens)] for the KV cache.
Related article.
closed