Tech & Random Stuff

Local LLM Deployment: Running Frontier Models on Your Own Hardware

A practical guide to deploying local LLMs with llama.cpp, including hardware requirements, quantization, and performance optimization.

2026-06-10·15 min read

Running large language models locally gives you privacy, no API costs, and full control over inference. This guide covers everything from hardware requirements to production deployment using llama.cpp.

Why Local LLMs?

  • Privacy — your data never leaves your machine
  • No API costs — run unlimited inference for the cost of electricity
  • Full control — customize quantization, context length, and sampling
  • Offline capability — works without internet
  • Fine-tuning — adapt models to your specific domain

Hardware Requirements

Model SizeMin VRAM (Q4)Recommended VRAMRAMExample Models
7-8B6 GB8 GB16 GBLlama 3.1 8B, Qwen 2.5 7B
13-14B10 GB16 GB32 GBLlama 3.1 13B, Qwen 2.5 14B
32-34B20 GB24 GB64 GBQwen 2.5 32B, DeepSeek Coder V2 Lite
70B+40 GB48+ GB128 GBLlama 3.1 70B, DeepSeek V3 (MoE)

Building llama.cpp

llama.cpp is the most efficient CPU/GPU inference engine for GGUF models. Here's how to build it with CUDA support:

# Clone and build with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89  # 89 = RTX 4090
cmake --build . --config Release -j

# The server binary enables OpenAI-compatible API:
./bin/llama-server -m model.gguf --port 8080 --n-gpu-layers 99

Choosing the Right Quantization

Quantization reduces model size with minimal quality loss. The sweet spot depends on your VRAM:

✦ TipQ5_K_M is the sweet spot — nearly indistinguishable from FP16 quality while using 40% less VRAM. Always prefer K-quants over legacy quants.

QuantBitsSize (70B)Quality LossRecommended For
Q8_08-bit~70 GBNegligibleBest quality, lots of VRAM
Q6_K6-bit~55 GBMinimalHigh quality, good VRAM
Q5_K_M5-bit~48 GBVery smallBest balance (recommended)
Q4_K_M4-bit~40 GBSmallLimited VRAM
Q3_K_M3-bit~32 GBNoticeableVery limited VRAM

Performance Optimization

  • Use --n-gpu-layers 99 to offload all layers to GPU
  • Set --ctx-size to match your needs — larger context = more VRAM
  • Use --flash-attn for faster attention computation
  • Enable --mlock to prevent swapping (if you have enough RAM)
  • Use --cont-batching for continuous batching in server mode
  • Experiment with --threads for CPU-bound inference

OpenAI-Compatible API

llama.cpp's server mode provides an OpenAI-compatible API, so you can use it with any client that supports OpenAI:

# Start the server
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Multi-Model Setup

For a production setup, run multiple models on different ports and use a router to direct requests:

  • Small model (7B) on port 8081 for fast, simple tasks
  • Medium model (32B) on port 8082 for balanced tasks
  • Large model (70B) on port 8083 for complex reasoning
  • Use a reverse proxy to route based on task complexity

ℹ InfoThe KCAI Desktop app (see AI Projects section) integrates directly with llama.cpp for local inference while also supporting cloud providers as fallback.

← All tutorialsTech & Random Stuff