Local AI Starter: How to Run a Useful AI Agent on a 2018 Gaming GPU

Paulo Rodrigues8 min read

Local AI Starter: How to Run a Useful AI Agent on a 2018 Gaming GPU

"You need at least 24GB of VRAM for a useful local model" is the most expensive myth in self-hosted AI. It pushes people straight to a cloud subscription — and straight into sending their prompts and documents to someone else's machine.

My daily AI agent does not work that way. It runs a 12-billion-parameter model on an RTX 2080 Ti — an 11GB gaming card from 2018 — entirely on the machine under my desk. No API key. No prompts leaving the building to a third-party model. It is not a science project; it is the tool I actually use to research, read, summarise, and verify.

Here is exactly how it fits, the stack that serves it, and — just as important — what it cannot do.

What "useful" means here

Before the hardware, the honest framing: this is not a chatbot demo. The agent does real work in a loop — it runs a web search, opens the pages, reads them, and writes a synthesis with sources. That is the shape of most day-to-day "AI agent" tasks, and almost none of it needs the most expensive model in the world. It needs a decent model that you control.

That distinction — capability you rent versus capability you own — is the whole point. So let's make ownership concrete.

The memory budget: how a 12B model fits an 11GB card

The single most useful idea in local AI is quantization-aware training (QAT). A naive 4-bit quantization of a model visibly loses accuracy. A QAT model is trained to survive 4-bit, so it keeps near-full-precision quality at roughly a quarter of the memory. The model I run — Gemma 4 12B QAT — lands at about 10GB of the 11GB card.

That leaves roughly 1GB. Here is what it buys, and why each piece matters:

  • A 96,000-token context window. Enough to feed real documents into the model, not toy prompts. Context lives in the KV cache, and how you quantize that cache decides how much context fits in the leftover memory.
  • A small 0.4B "draft" model for speculative decoding. It predicts several tokens ahead and the big model verifies them in one pass. On my setup this pushes generation to about 93 tokens per second — fast enough for real work, not a demo. The draft model shares the big model's cache, so it costs almost no extra VRAM.
  • Headroom. So the card does not run out of memory mid-task.

The lesson is not "buy a 2080 Ti." It is that the VRAM number on the box is not the constraint people assume. Quantization, KV-cache settings, and an optional draft model decide what actually fits — and most "you need a bigger GPU" advice ignores all three.

The serving stack: llama.cpp, and why I skipped CUDA

The model is served by llama.cpp using the Vulkan backend — not CUDA. That choice surprises people, so I benchmarked it instead of assuming.

On this Turing card, the results were clear:

  • Token generation — the speed you feel in every chat or agent turn — barely moved with CUDA. Around 8% at most.
  • Prompt processing (the long-context prefill) is where CUDA wins, roughly 1.5x.

The reason is physics, not software. Token generation is limited by memory bandwidth, not raw compute. My setup already runs near the card's bandwidth ceiling of about 616 GB/s, so a different backend cannot beat the wall. And CUDA carries a tax: there is no official prebuilt for my setup, which means recompiling a multi-gigabyte toolkit on a near-daily release cadence. The Vulkan path is already accelerated on Turing and survives kernel updates untouched.

If you mostly generate — chat, agents — you are bandwidth-bound, and the default backend is probably fine. Only if you mostly ingest huge documents does prefill speed become worth optimising. Measure your workload before you cargo-cult "install CUDA."

What it costs — and what you get back

The marginal cost of each question is electricity, not a monthly bill that grows with use. A used gaming card is a one-time cost that pays for itself against any per-token cloud invoice surprisingly fast. (We have written before about the €107 infrastructure behind this whole setup — the philosophy is the same: own the boring parts.)

But the real return is not the money. It is two things the cloud cannot give you:

  • Privacy by construction. Every cloud AI call puts your data — or your client's — on someone else's model. Run the model yourself and your prompts never reach a third party. There is no data-processing agreement to read, because there is no third party.
  • Compliance posture. The EU AI Act's main high-risk obligations land on 2 August 2026. "Where is this processed?" stops being a technical footnote and becomes a question you may have to answer. "On hardware I own, in the EU, with nothing leaving the building" is a strong answer.

The honest limits

A starter guide that only sells the upside is lying. Two limits matter.

Small models fabricate on hard synthesis. An 8-billion-parameter model will drive tools perfectly well and then confidently blend facts when asked to reason across them — one of mine once "calculated" 366 days in a year. The failure is bad reasoning over good data, not missing data, which makes it easy to miss. For multi-step synthesis, a 12B is roughly the floor I trust, and even then I verify.

Reliability comes from the safety nets, not just the model. My agent once repeated the same line 4,262 times and ran the GPU at 100% for over six minutes without stopping. The obvious fix — a repetition penalty — actually made it worse, because those penalties corrupt exactly the tokens you want intact: URLs, version numbers, dates. The real fix was a deterministic guard that watches the output and aborts a runaway loop, plus a check that the model never cites a page it did not actually open. In local AI, the model is half the work. The other half is the guardrails around it.

The takeaway

You do not need a data centre or a cloud subscription to run a genuinely useful AI agent. You need a used GPU, a quantization-aware model, a sensible serving stack, and the discipline to build a few safety nets.

The strategy that follows from all this is simple: own the boring 80% — the everyday searching, reading, summarising, and verifying — on hardware you control, and rent the frontier 20% only when a task genuinely demands it. For a European business, that is not just cheaper. It is a privacy and compliance posture you can defend.

If a seven-year-old gaming card can do this, the question worth asking is not "can we run AI locally?" It is "what exactly are we renting the cloud for?"

Frequently Asked Questions

How much VRAM do you really need to run a useful local LLM?

Far less than the '24GB minimum' you usually hear. I run a 12-billion-parameter model on an 11GB card. The trick is quantization-aware training (QAT) at 4-bit, which keeps near-full-precision quality while the weights occupy about 10GB. The number on the GPU box is not the real constraint — quantization, KV-cache settings, and an optional small draft model decide what actually fits.

Do I need CUDA to run AI on an NVIDIA GPU?

Not necessarily. On an older Turing card I benchmarked CUDA against the Vulkan backend in llama.cpp. Token generation — the speed you feel in a chat or agent turn — barely changed, because generation is limited by memory bandwidth, not compute. CUDA mainly helps long-context prompt processing (about 1.5x there) and carries a real maintenance cost. For chat and agent work, the default Vulkan path was fine.

Is a local model as good as a frontier cloud model?

No, and that is the wrong question. A small local model is excellent at the boring 80% of agent work — searching, reading, summarising, verifying — and weaker at hard multi-step synthesis, where small models can fabricate. The right posture is to own the everyday workload locally and call a frontier model only for the 20% that genuinely needs it.

Ready to automate your business?

We build AI tools and automation systems for European SMEs — from rapid MVPs to production systems, always GDPR-compliant.

it's human stuff

Weekly AI insights for European SMEs. No hype, just what works.

Keep Reading