Tiered Multi-GPU Inference: VRAM Tier vs Architecture Generation
An experiment to isolate two variables that existing multi-GPU benchmarks conflate: VRAM capacity and architecture generation (specifically, Tensor Core availability). Most consumer multi-GPU work assumes matched RTX cards. This experiment uses matched GTX cards as a control group to measure what Tensor Cores actually contribute to inference throughput when VRAM is held roughly constant.
Research Questions
Section titled “Research Questions”Primary: At what VRAM tier does PCIe-bandwidth-limited multi-GPU inference become viable for practical use (defined as >3 tok/s on a quantised model)?
Secondary: What is the measurable contribution of Tensor Cores to inference throughput when VRAM capacity is held constant?
Tertiary: Is there a crossover point where adding more low-generation VRAM outperforms fewer high-generation VRAM slots?
Why This Matters
Section titled “Why This Matters”The existing multi-GPU pooling experiment tests matched RTX 2060 Supers and RTX 3060s — same architecture generation, scaling VRAM by adding cards. That design answers “does pooling work?” but cannot answer “does architecture matter?”
Tortuga’s GTX fleet provides the missing control group. The GTX 1060 6 GB and RTX 2060 Super 8 GB share enough VRAM to be roughly comparable, but differ in architecture generation and Tensor Core availability. Running the same scaling experiment on both tracks — one GTX, one RTX — isolates the architecture variable.
The community assumption is that Tensor Cores accelerate inference. For compute-bound workloads, they do. But LLM inference is memory-bandwidth-bound, not compute-bound. The GTX 1060 6 GB has 192 GB/s bandwidth; the RTX 2060 Super has 448 GB/s. If the RTX advantage is primarily bandwidth rather than Tensor Cores, that changes the purchasing advice for budget builders: older high-bandwidth cards may be better buys than newer low-bandwidth ones.
Experimental Design
Section titled “Experimental Design”Track A — GTX (No Tensor Cores)
Section titled “Track A — GTX (No Tensor Cores)”All cards from Tortuga. Pascal architecture, CUDA cores only.
| Config | Cards | Total VRAM | Per-Card BW |
|---|---|---|---|
| A1 | 1x GTX 1060 6 GB | 6 GB | 192 GB/s |
| A2 | 2x GTX 1060 6 GB | 12 GB | 192 GB/s each |
| A3 | 3x GTX 1060 6 GB | 18 GB | 192 GB/s each |
Tortuga has three GTX 1060 6 GB cards. All configurations are available without additional acquisition.
Track B — RTX (Tensor Cores)
Section titled “Track B — RTX (Tensor Cores)”All cards from Colmena. Turing architecture, Tensor Core enabled.
| Config | Cards | Total VRAM | Per-Card BW |
|---|---|---|---|
| B1 | 1x RTX 2060 Super 8 GB | 8 GB | 448 GB/s |
| B2 | 2x RTX 2060 Super 8 GB | 16 GB | 448 GB/s each |
| B3 | 3x RTX 2060 Super 8 GB | 24 GB | 448 GB/s each |
Controls
Section titled “Controls”| Config | Card | Total VRAM | Notes |
|---|---|---|---|
| C1 | 1x RTX 3090 24 GB | 24 GB | Single-card control for B3 (planned acquisition) |
| C2 | 1x GTX Titan X 12 GB | 12 GB | Single-card control for A2 (Tortuga) |
C1 compares 3x RTX 2060 Super (24 GB pooled, 448 GB/s per card) against RTX 3090 (24 GB monolithic, 936 GB/s). C2 compares 2x GTX 1060 6 GB (12 GB pooled, 192 GB/s per card) against the Titan X (12 GB monolithic, 336 GB/s). Both test whether pooled commodity cards match a single high-VRAM card at the same total capacity.
Methodology
Section titled “Methodology”- Inference engine: llama.cpp with
--tensor-splitfor multi-GPU layer distribution - Models: Llama 3.1 8B, Llama 3.1 13B (Q4_K_M and Q8_0 quantisations)
- Metrics:
- Tokens/second (prompt processing and generation, separately)
- VRAM utilisation per card
- CPU offload percentage (if any layers spill to system RAM)
- PCIe bandwidth utilisation
- Control variables: Same host machine per track (Colmena for Track B, Tortuga for Track A), same prompt set, same context length, same quantisation per run
- Benchmark harness: LocoBench standard harness for consistent logging and reproducibility
Cross-Track Comparison Points
Section titled “Cross-Track Comparison Points”The most informative comparisons isolate one variable at a time:
| Comparison | What It Tests |
|---|---|
| A1 (6 GB GTX) vs B1 (8 GB RTX) | Architecture + bandwidth difference at similar VRAM |
| A2 (12 GB GTX pooled) vs C2 (12 GB Titan X) | Pooled vs monolithic, no Tensor Cores in either |
| B3 (24 GB RTX pooled) vs C1 (24 GB RTX 3090) | Pooled vs monolithic, Tensor Cores in both |
| A2 (12 GB GTX pooled) vs B1 (8 GB RTX single) | More VRAM (no Tensor Cores) vs less VRAM (Tensor Cores) |
| A3 (18 GB GTX pooled) vs B2 (16 GB RTX pooled) | Similar VRAM, different architecture, both multi-GPU |
Expected Outcomes
Section titled “Expected Outcomes”-
PCIe bottleneck will suppress multi-GPU gains relative to theoretical VRAM scaling. Scaling efficiency will be well below 100%, consistent with the pooling experiment hypothesis.
-
Tensor Core advantage will be most visible at Q8_0. At Q4_K_M, the quantised weights are small enough that memory bandwidth dominates. At Q8_0, the larger weight representation may give Tensor Cores more to work with.
-
Bandwidth may matter more than Tensor Cores. The RTX 2060 Super’s 448 GB/s vs the GTX 1060’s 192 GB/s is a 2.3x bandwidth advantage. If Track B outperforms Track A by roughly that ratio regardless of quantisation level, Tensor Cores are not the differentiator — bandwidth is.
-
There may be a “good enough” threshold. 18 GB of GTX VRAM (A3) at >3 tok/s would mean larger models are accessible on salvaged pre-RTX hardware. Whether that threshold is reachable depends on the PCIe penalty at three-card configurations.
Hardware Availability
Section titled “Hardware Availability”Both tracks are ready to run. Colmena has three RTX 2060 Supers installed. Tortuga has three GTX 1060 6 GB cards. The Titan X control (C2) is already in Tortuga. The RTX 3090 control (C1) is a planned Colmena acquisition.
Significance
Section titled “Significance”This experiment documents a reproducible, low-cost pathway for running larger models on salvaged consumer hardware. The GTX track is especially relevant: if pre-RTX cards scale meaningfully across PCIe, a substantial pool of secondhand Maxwell and Pascal hardware becomes viable for multi-GPU inference. That finding — positive or negative — is directly useful to low-resource AI research contexts and institutions considering on-premise inference without cloud spend.
The cross-architecture comparison provides baseline data that does not exist in the literature. Consumer multi-GPU benchmarks either assume matched RTX hardware or test single cards. Nobody has systematically compared GTX scaling against RTX scaling on the same workloads.
Part of the LocoConvoy project. Related: Multi-GPU Inference Pooling, LocoBench, LocoLab.