Running large language models locally gives you privacy, no API costs, and full control over inference. This guide covers everything from hardware requirements to production deployment using llama.cpp.
Why Local LLMs?
- ▸Privacy — your data never leaves your machine
- ▸No API costs — run unlimited inference for the cost of electricity
- ▸Full control — customize quantization, context length, and sampling
- ▸Offline capability — works without internet
- ▸Fine-tuning — adapt models to your specific domain
Hardware Requirements
| Model Size | Min VRAM (Q4) | Recommended VRAM | RAM | Example Models |
|---|---|---|---|---|
| 7-8B | 6 GB | 8 GB | 16 GB | Llama 3.1 8B, Qwen 2.5 7B |
| 13-14B | 10 GB | 16 GB | 32 GB | Llama 3.1 13B, Qwen 2.5 14B |
| 32-34B | 20 GB | 24 GB | 64 GB | Qwen 2.5 32B, DeepSeek Coder V2 Lite |
| 70B+ | 40 GB | 48+ GB | 128 GB | Llama 3.1 70B, DeepSeek V3 (MoE) |
Building llama.cpp
llama.cpp is the most efficient CPU/GPU inference engine for GGUF models. Here's how to build it with CUDA support:
# Clone and build with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89 # 89 = RTX 4090
cmake --build . --config Release -j
# The server binary enables OpenAI-compatible API:
./bin/llama-server -m model.gguf --port 8080 --n-gpu-layers 99Choosing the Right Quantization
Quantization reduces model size with minimal quality loss. The sweet spot depends on your VRAM:
✦ TipQ5_K_M is the sweet spot — nearly indistinguishable from FP16 quality while using 40% less VRAM. Always prefer K-quants over legacy quants.
| Quant | Bits | Size (70B) | Quality Loss | Recommended For |
|---|---|---|---|---|
| Q8_0 | 8-bit | ~70 GB | Negligible | Best quality, lots of VRAM |
| Q6_K | 6-bit | ~55 GB | Minimal | High quality, good VRAM |
| Q5_K_M | 5-bit | ~48 GB | Very small | Best balance (recommended) |
| Q4_K_M | 4-bit | ~40 GB | Small | Limited VRAM |
| Q3_K_M | 3-bit | ~32 GB | Noticeable | Very limited VRAM |
Performance Optimization
- ▸Use
--n-gpu-layers 99to offload all layers to GPU - ▸Set
--ctx-sizeto match your needs — larger context = more VRAM - ▸Use
--flash-attnfor faster attention computation - ▸Enable
--mlockto prevent swapping (if you have enough RAM) - ▸Use
--cont-batchingfor continuous batching in server mode - ▸Experiment with
--threadsfor CPU-bound inference
OpenAI-Compatible API
llama.cpp's server mode provides an OpenAI-compatible API, so you can use it with any client that supports OpenAI:
# Start the server
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'Multi-Model Setup
For a production setup, run multiple models on different ports and use a router to direct requests:
- ▸Small model (7B) on port 8081 for fast, simple tasks
- ▸Medium model (32B) on port 8082 for balanced tasks
- ▸Large model (70B) on port 8083 for complex reasoning
- ▸Use a reverse proxy to route based on task complexity
ℹ InfoThe KCAI Desktop app (see AI Projects section) integrates directly with llama.cpp for local inference while also supporting cloud providers as fallback.