Skip to content

Multi-GPU AI Inference: A Practical Guide

Local LLM Hardware, VRAM Pooling, and the Cost-Effectiveness Question

Section titled “Local LLM Hardware, VRAM Pooling, and the Cost-Effectiveness Question”

1. Why Multiple GPUs for Local LLM Inference?

Section titled “1. Why Multiple GPUs for Local LLM Inference?”

The single biggest constraint in running large language models locally is VRAM (Video RAM). A model’s minimum VRAM requirement is roughly:

model parameters x bytes per parameter

So a 7B parameter model at 4-bit quantization needs roughly 4 GB, a 13B model needs ~7-8 GB, a 34B model needs ~20 GB, and a 70B model needs ~40 GB. Most consumer GPUs top out at 8-24 GB, which puts anything above 13B out of reach on a single card.

Multi-GPU setups solve this in one of two ways:

  • Running bigger models than any single card could hold (VRAM pooling)
  • Running faster inference by parallelising the compute across cards (though gains are more nuanced — see Section 3)

For most home lab and research users, the first goal is the dominant reason to go multi-GPU.


2. How Inference Engines Split Models Across GPUs

Section titled “2. How Inference Engines Split Models Across GPUs”

The UI layer (Open WebUI, LM Studio, etc.) does not do the heavy lifting. The inference engine running underneath manages the GPU splits. There are two main strategies:

2.1 Pipeline Parallelism (Layer Offloading)

Section titled “2.1 Pipeline Parallelism (Layer Offloading)”

Used by: llama.cpp, Ollama (via llama.cpp under the hood), LM Studio, Jan

The model’s layers are divided sequentially across GPUs like an assembly line:

GPU 0: layers 1-12 -> GPU 1: layers 13-24 -> GPU 2: layers 25-36

Each token generated travels down the pipeline from card to card via your PCIe bus. Only one GPU is active at a time per token pass.

Implications:

  • Identical GPUs: each stage takes the same time, no bottleneck between stages
  • Mixed GPUs: total throughput is limited by the slowest card’s bandwidth
  • PCIe slot speed matters — a card in a x4 slot starves even if the card itself is fast
  • Overall throughput does not scale linearly with GPU count; you’re running a larger model, not a faster one

Used by: ExLlamaV2, vLLM, TensorRT-LLM

Every layer is sliced in half (or thirds, etc.) and each GPU computes its slice simultaneously:

Layer N: GPU 0 computes left half | GPU 1 computes right half -- simultaneously
| synchronise results |
Layer N+1: repeat

Implications:

  • GPUs work in parallel — genuine throughput gains are possible
  • Synchronisation overhead between cards adds latency — requires fast interconnects (NVLink ideal, PCIe acceptable for same-generation cards)
  • Mixed-speed GPUs still bottleneck at the slower card per synchronisation step
  • Higher tokens/sec ceiling than pipeline parallelism on matched hardware
ScenarioRecommended Approach
Just need the model to fit in VRAMPipeline parallelism (Ollama/llama.cpp) — zero config
Matched GPUs, want maximum t/sTensor parallelism via ExLlamaV2
Mixed GPU generationsPipeline parallelism — tensor parallelism’s sync overhead punishes mixed speeds more
Fine-tuningNeither; use a single fast GPU with bf16 support

3. The RTX 2060 Super: An Underrated Inference Card

Section titled “3. The RTX 2060 Super: An Underrated Inference Card”

The RTX 2060 Super is frequently overlooked in LLM discussions because it is an older generation. This is a mistake for inference workloads.

LLM inference is a memory-bandwidth-bound task. The rate at which weights can be streamed from VRAM into the compute units determines tokens-per-second far more than raw shader throughput.

GPUVRAMMemory BusBandwidthGen
RTX 2060 Super8 GB256-bit448 GB/sTuring
RTX 3060 12 GB12 GB192-bit360 GB/sAmpere
RTX 3060 Ti8 GB256-bit448 GB/sAmpere
RTX 30708 GB256-bit448 GB/sAmpere
RTX 40608 GB128-bit272 GB/sAda
RTX 4060 Ti 16 GB16 GB128-bit288 GB/sAda
RTX 407012 GB192-bit504 GB/sAda
RTX 309024 GB384-bit936 GB/sAmpere
RTX 409024 GB384-bit1,008 GB/sAda

The 2060 Super’s 256-bit bus gives it bandwidth parity with the 3060 Ti and 3070 — cards that were released years later at much higher prices. The 4060 and 4060 Ti are significantly slower for inference despite being newer.

3.2 Three RTX 2060 Supers: What You Actually Get

Section titled “3.2 Three RTX 2060 Supers: What You Actually Get”
MetricValue
Pooled VRAM24 GB
Per-card bandwidth448 GB/s
Effective pipeline throughput~448 GB/s (sequential)
Model size ceiling~20-22 GB usable (leaving headroom for KV cache)
Comparable single-card VRAMRTX 3090 / RTX 4090

With 3x matched cards, the “weakest link” penalty disappears entirely — every stage in the pipeline runs at the same rate.

These are approximate single-user inference figures for pipeline parallelism:

ModelQuantisationVRAM UsedEstimated t/s
Llama 3 8BQ4_K_M~5 GB35-50 t/s
Mistral 7BQ4_K_M~4.5 GB35-55 t/s
Qwen 2.5 14BQ4_K_M~9 GB20-30 t/s
Llama 3 70BQ2_K~26 GBToo large
Llama 3 70BIQ1_M~18 GB8-12 t/s
Mistral 22BQ4_K_M~13 GB14-20 t/s
Command-R 35BQ3_K_M~15 GB12-18 t/s
Yi-34BQ4_K_M~20 GB10-15 t/s

Verdict: For a single-user research or teaching context, 10-20 t/s on a high-quality 34B model is entirely usable. Conversational feel begins around 8 t/s; anything above 15 t/s is comfortable.

A mining-style chassis (e.g., WEIHO 6-GPU) with six RTX 2060 Supers creates an interesting decision point:

ConfigurationCost (AUD, approx.)Total VRAMArchitecture
6x RTX 2060 Super$600-700 secondhand48 GB6 independent 8 GB cards
WEIHO chassis + PSU~$190PCIe x1 risers
Total~$800-90048 GB

This is less than a single RTX 3090 (24 GB, ~$800-1000 AUD). But the 48 GB figure is misleading without understanding two very different use modes:

Pooled VRAM (one large model split across cards): Using llama.cpp --tensor-split 1,1,1,1,1,1, a single ~40 GB model can theoretically run across all six cards. However, mining chassis use PCIe x1 riser cables, which dramatically limit inter-GPU bandwidth compared to x8 or x16 slots. This makes layer splitting functional but slow — each pipeline stage waits on the narrow PCIe x1 link to pass activations to the next card. For a single large model, a single RTX 3090 at 936 GB/s bandwidth will be significantly faster than six 2060 Supers connected by risers.

Independent workers (six separate models): The compelling use case is running six independent 7B model instances simultaneously — one per card, no cross-GPU communication needed. This is ideal for:

  • Load balancing across multiple concurrent users
  • Mixture-of-Agents (MoA) architectures where multiple models contribute to a response
  • Batch evaluation (e.g., running LocoBench across six models in parallel)
  • A/B testing different models or quantisation levels side by side

Six concurrent 7B inference workers for under $1000 AUD is remarkable value. The PCIe x1 riser limitation is irrelevant when each card runs independently.

Key insight: The value of many cheap GPUs depends entirely on whether you need one big model or many small ones. For pooled VRAM, fewer cards in proper PCIe slots wins. For concurrent independent inference, more cards wins regardless of slot bandwidth.

  • No native bf16 support — the Turing architecture pre-dates bfloat16. This effectively rules out fine-tuning modern LLMs on these cards (QLoRA via fp16 is possible but slower and less stable).
  • Older CUDA cores — raw TFLOP compute is well behind Ampere/Ada for tasks that are compute-bound (batch inference, training).
  • Thermal management — three or more cards running sustained inference in a single chassis needs airflow planning. Mining chassis have open-air designs that help, but power draw scales linearly (~75W per card under load).
  • PCIe riser bandwidth — mining-style x1 risers work fine for independent inference but are a severe bottleneck for layer-split or tensor-split workloads. For pooled VRAM, cards in proper x8/x16 motherboard slots are strongly preferred.

ToolMulti-GPUConfig EffortFormat SupportBest For
OllamaLayer splitting via llama.cppLowGGUFQuickest path to working multi-GPU chat
llama.cpp CLI--tensor-split flagsLow (script it)GGUFCustom launch scripts, fine-grained layer control
ExLlamaV2Explicit per-GPU allocationMediumEXL2, GPTQMaximum t/s on matched Nvidia cards
TabbyAPIExLlamaV2 backendMediumEXL2, GPTQLighter alternative to vLLM for local multi-GPU serving
vLLMTensor parallelismMedium-HighGPTQ, AWQ, fp16High-throughput production serving, batch inference
UIDeploymentMulti-GPUNotes
Open WebUIDocker / PythonVia Ollama backendBest self-hosted ChatGPT clone
LM StudioDesktop appBuilt-in GPU toggle panelMost beginner-friendly; visual GPU allocation
Text Gen WebUI (Oobabooga)Python / local webPer-GPU VRAM allocationMost granular control; power-user tool
AnythingLLMDesktop or DockerVia Ollama/LM StudioBest for RAG / document Q&A
JanDesktop appAutomatic (llama.cpp)Clean open-source LM Studio alternative

For CLI chat (lowest friction):

Terminal window
# Ollama splits layers across GPUs via llama.cpp under the hood.
# Set CUDA_VISIBLE_DEVICES to select GPUs, or let it auto-detect.
CUDA_VISIBLE_DEVICES=0,1,2 ollama run command-r # ~15GB Q4 model, splits across 3 cards
# For finer control over the split:
# OLLAMA_GPU_SPLIT=8,8,8 ollama run command-r
# Or let Ollama distribute layers automatically:
# Use PARAMETER num_gpu 999 in a Modelfile to push all layers to GPU

Note: Ollama’s multi-GPU is layer splitting (pipeline parallelism), not true tensor parallelism. It is functionally VRAM pooling — a model larger than one card’s VRAM can run across multiple cards — but with PCIe bandwidth overhead between stages. For benchmarking where you need precise control over the split, use llama.cpp directly.

For maximum performance (ExLlamaV2):

from exllamav2 import ExLlamaV2, ExLlamaV2Config
config = ExLlamaV2Config()
config.model_dir = "/models/mistral-22b-exl2"
config.prepare()
model = ExLlamaV2(config)
# Explicit split: 8GB each across 3 GPUs
model.load([8192, 8192, 8192])

For a tuned launch script (llama.cpp):

Terminal window
./llama-cli \
-m /models/command-r-q4_k_m.gguf \
-ngl 99 \ # push all layers to GPU
--tensor-split 1,1,1 \ # equal split across 3 cards
-c 8192 \ # context window
--temp 0.7 \
-i # interactive / chat mode

For production-style multi-GPU serving (vLLM):

Terminal window
# Genuine tensor parallelism -- GPUs work in parallel, not sequentially
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-22B \
--tensor-parallel-size 3 \
--gpu-memory-utilization 0.9

vLLM provides an OpenAI-compatible API, so any frontend (Open WebUI, etc.) can connect to it. Higher setup complexity than Ollama but better multi-GPU utilisation for a single large model.

For lightweight local serving (TabbyAPI):

TabbyAPI uses ExLlamaV2 under the hood with an OpenAI-compatible API. Lighter than vLLM, good multi-GPU support, designed for single-user local inference. A middle ground between Ollama’s simplicity and vLLM’s production features.

Practical note: These tools don’t conflict — you can have Ollama for quick single-card use, llama.cpp for benchmarking with precise control, and vLLM for multi-GPU serving, all on the same machine.


Apple’s M-series chips take a fundamentally different architectural approach. Rather than discrete VRAM on a GPU card, they use unified memory — a single high-bandwidth memory pool shared between CPU, GPU, and Neural Engine. This has significant implications for LLM inference.

ChipConfigUnified RAMBandwidth
M4 (base)Mac Mini entry16-32 GB120 GB/s
M4 ProMac Mini Pro, MacBook Pro24-64 GB273 GB/s
M4 MaxMac Studio, MacBook Pro36-128 GB546 GB/s
M4 UltraMac Studio top192 GB819 GB/s
M5 Pro (est.)Mac Mini/MacBook Pro24-64 GB~350 GB/s
M5 Max (est.)Mac Studio64-128 GB~660 GB/s

Note: M5 specifications are provisional based on Apple’s typical generation-over-generation gains (~20-25% bandwidth improvement). Verify current Apple specs before purchasing.

A critical point often misunderstood: multiple Mac Minis do not natively pool their unified memory the way multi-GPU does on a single PC.

Each Mac Mini is a fully independent computer. To run a model split across two or more Mac Minis, you need distributed inference via tools like:

  • llama.cpp distributed mode (--rpc server mode, experimental)
  • Petals (distributed transformer inference over a local network)
  • Exo (open-source project specifically for Apple Silicon cluster inference)

This introduces network latency between nodes (even on 10GbE), which adds overhead that unified-memory solutions do not have. In practice, a Mac Mini cluster for inference is more complex to operate than a multi-GPU PC, and the per-node memory is not additive without latency penalties.

5.3 Side-by-Side: 3x RTX 2060 Super vs. Apple Options

Section titled “5.3 Side-by-Side: 3x RTX 2060 Super vs. Apple Options”
ConfigurationApprox. AUD CostEffective “VRAM”Bandwidth (inference)Notes
3x RTX 2060 Super (secondhand)$450-70024 GB (pooled)~448 GB/s (pipeline)Needs host PC; power draw ~450W under load
Mac Mini M4 (32 GB)~$1,80032 GB120 GB/sQuiet, low power (~25W idle), but slow bandwidth
Mac Mini M4 Pro (48 GB)~$2,30048 GB273 GB/sBetter balance; good for 34B models
Mac Mini M4 Pro (64 GB)~$2,70064 GB273 GB/sComfortable 70B headroom
2x Mac Mini M4 Pro (48 GB each)~$4,60048 GB each (not pooled)273 GB/s per nodeDistributed only; complex setup
Mac Studio M4 Max (128 GB)~$4,500+128 GB546 GB/sSerious single-machine performance
Mac Studio M4 Ultra (192 GB)~$8,000+192 GB819 GB/sNear-datacenter for 70B+ unquantised
PlatformApprox. t/s (34B Q4)Notes
3x RTX 2060 Super10-15 t/sPipeline parallelism; conversational
Mac Mini M4 (32 GB)8-12 t/sBandwidth-limited; fits but slow
Mac Mini M4 Pro (48 GB)18-25 t/sBetter bandwidth; model fits cleanly
Mac Studio M4 Max (128 GB)35-55 t/sComfortable headroom; fast

At the 34B model size, the 3x 2060 Super setup is genuinely competitive with a Mac Mini M4 Pro configuration costing 3-4x more — and faster than a base Mac Mini M4.


6.1 Bang-Per-Dollar at Common Price Points

Section titled “6.1 Bang-Per-Dollar at Common Price Points”

Assuming secondhand Australian pricing:

ConfigAUD CostPooled VRAMMax Model SizeInference Speed (est.)
2x RTX 2060 Super$300-45016 GB~13B Q4 comfortably25-40 t/s on 13B
3x RTX 2060 Super$450-70024 GB~34B Q410-15 t/s on 34B
4x RTX 2060 Super$600-95032 GB~34B Q4 or 70B IQ210-14 t/s on 34B
RTX 3090 (single)$800-1,00024 GB~34B Q425-35 t/s on 34B
Mac Mini M4 Pro 48 GB~$2,30048 GB~70B Q418-25 t/s on 34B

The 3x RTX 2060 Super configuration is the most cost-effective entry point into 34B-class model inference. Its main trade-off against a single RTX 3090 of the same VRAM capacity is throughput: the 3090’s 936 GB/s bandwidth (vs. the pipeline-limited ~448 GB/s effective rate of the tri-card setup) roughly doubles tokens per second on models that fit entirely on the 3090.

A 4th 2060 Super adds 8 GB of headroom (32 GB total) for a marginal cost, allowing more breathing room for KV cache on long contexts or slightly less aggressive quantisation on 34B models. The throughput gain is minimal since pipeline parallelism adds another stage.

6.2 When the Mac Ecosystem Makes More Sense

Section titled “6.2 When the Mac Ecosystem Makes More Sense”
  • You need more than 32 GB in a single logical device without distributed inference complexity
  • You prioritise power efficiency (a Mac Mini M4 Pro idles at ~8W, a tri-GPU PC at ~150W baseline)
  • You want macOS and the Apple ecosystem for development
  • You are running 70B+ models regularly — this is where Mac Studio M4 Max/Ultra pulls significantly ahead
  • You need silence — a three-GPU workstation is not quiet
  • Budget is the primary constraint — $600 vs. $2,300+ for similar 34B performance
  • You already have a compatible host machine (the GPU cost dominates)
  • You want flexibility — swap cards, add more, run CUDA-specific tools
  • You are comfortable with Linux and CLI tooling
  • You plan to experiment with fine-tuning pipelines eventually (you would add a bf16-capable card for that role)

Given the existing 3x RTX 2060 Super inventory plus a research focus on cost-effective inference:

  1. Run Ollama as the primary inference backend — zero-config multi-GPU pooling, REST API available for all frontends and custom scripts.

  2. Add ExLlamaV2 as a secondary path for benchmarking — EXL2 quantised models at 3.0-4.0 bpw on 34B give a meaningful t/s improvement over GGUF Q4 at the same VRAM footprint.

  3. Target the 14B-22B sweet spot rather than the absolute 34B ceiling. A Mistral 22B or Qwen2.5 14B at Q6_K will outperform a Yi-34B at Q3 in both quality and speed given your VRAM headroom.

  4. A fourth 2060 Super is a worthwhile addition if available cheaply — 32 GB enables running Llama 3 70B at IQ2_XS (~17 GB) with a decent context window.

  5. PCIe slot planning matters — ensure all three (or four) cards are in x8 or x16 electrical slots. A card in a x4 slot will bottleneck inter-GPU data transfer under pipeline parallelism, effectively wasting bandwidth.

  6. For LocoBench benchmarking — the 2060 Super’s profile (high bandwidth, modest compute, older CUDA) makes it a genuinely interesting reference point for research on cost-optimised inference hardware. It represents a class of GPU that is abundant, cheap, and frequently overlooked in the literature.


Multi-GPU inference is primarily a VRAM expansion strategy, not a speed multiplication strategy. Pipeline parallelism (the default in Ollama and llama.cpp) lets you run models that would not otherwise fit, at roughly the throughput of a single card. Tensor parallelism (ExLlamaV2, vLLM) on matched hardware can deliver genuine speed gains but requires more configuration.

The RTX 2060 Super is an excellent inference card hiding in plain sight — its 256-bit bus and 448 GB/s bandwidth outperform most mid-tier Ampere and Ada cards for single-user LLM workloads. Three of them for under $700 AUD gives 24 GB of pooled VRAM and usable performance on 34B-class models.

Apple Silicon’s unified memory architecture offers a genuinely different value proposition: simpler setup, better per-GB quality at higher VRAM tiers, lower power consumption, but at a significant cost premium. For a research lab budget, the multi-GPU PC approach delivers strong capability per dollar — particularly if the host machine already exists.

The Mac Mini cluster idea (“stacking” M4 or M5 Minis) is often discussed but rarely practical for inference: it requires distributed inference software, introduces network latency between nodes, and the memory pools are not additive in the same seamless way as multi-GPU on a single PCIe bus. A single well-specced Mac Studio M4 Max is a better investment than two Mac Minis for inference purposes.


Report compiled March 2026. GPU pricing reflects approximate AUD secondhand market. Token rate estimates are single-user, single-stream inference benchmarks and will vary by quantisation, context length, prompt complexity, and driver/software version.