Feature: Non-blocking CPU->GPU transfer

It seems that CPU->GPU transfer takes a while, but can be executed asynchronously and can be hidden behind other operations using data.to(device, non_blocking=True). Similarly, I/O mmap'ing can perhaps be hidden behind something (perhaps using the PyTorch dataloader functions).