Feature: Tensor parallel with FSDP

This is needed when FSDP cannot materialize a layer inside the memory of a GPU. This should only become a problem on very big context sizes (>16K), or on very big models (>100B). This should effectively enable scaling to any size of model / context length combination. Of course, this will incur significant communication costs, and there should probably be a report for how much that is.