Load balancing
Multiple independent model instances across separate GPUs serve concurrent users with no PCIe bottleneck. Six GPUs means six simultaneous workers.
Load balancing
Multiple independent model instances across separate GPUs serve concurrent users with no PCIe bottleneck. Six GPUs means six simultaneous workers.
Mixture of agents
Different specialist models collaborate on a single response, each running on its own GPU — without competing for the same VRAM pool.
Speculative decoding
A small draft model proposes tokens; a larger model verifies them. Trades inter-GPU bandwidth for faster generation by reducing full-model forward passes.
PCIe reality check
Datacenter multi-GPU uses NVLink at hundreds of GB/s. Consumer PCIe tops out at ~32 GB/s. LocoConvoy measures what that penalty actually costs — and when it doesn’t matter.