KC // kevcspot

A practical guide to deploying local LLMs with llama.cpp, including hardware requirements, quantization, and performance optimization.

Running large language models locally gives you privacy, no API costs, and full control over inference. This guide covers everything from hardware requirements to production deployment using llama.cpp.

Why Local LLMs?

▸Privacy — your data never leaves your machine
▸No API costs — run unlimited inference for the cost of electricity
▸Full control — customize quantization, context length, and sampling
▸Offline capability — works without internet
▸Fine-tuning — adapt models to your specific domain

Hardware Requirements

Model Size	Min VRAM (Q4)	Recommended VRAM	RAM	Example Models
7-8B	6 GB	8 GB	16 GB	Llama 3.1 8B, Qwen 2.5 7B
13-14B	10 GB	16 GB	32 GB	Llama 3.1 13B, Qwen 2.5 14B
32-34B	20 GB	24 GB	64 GB	Qwen 2.5 32B, DeepSeek Coder V2 Lite
70B+	40 GB	48+ GB	128 GB	Llama 3.1 70B, DeepSeek V3 (MoE)

Building llama.cpp

llama.cpp is the most efficient CPU/GPU inference engine for GGUF models. Here's how to build it with CUDA support:

# Clone and build with CUDA support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89  # 89 = RTX 4090
cmake --build . --config Release -j

# The server binary enables OpenAI-compatible API:
./bin/llama-server -m model.gguf --port 8080 --n-gpu-layers 99

Choosing the Right Quantization

Quantization reduces model size with minimal quality loss. The sweet spot depends on your VRAM:

✦ TipQ5_K_M is the sweet spot — nearly indistinguishable from FP16 quality while using 40% less VRAM. Always prefer K-quants over legacy quants.

Quant	Bits	Size (70B)	Quality Loss	Recommended For
Q8_0	8-bit	~70 GB	Negligible	Best quality, lots of VRAM
Q6_K	6-bit	~55 GB	Minimal	High quality, good VRAM
Q5_K_M	5-bit	~48 GB	Very small	Best balance (recommended)
Q4_K_M	4-bit	~40 GB	Small	Limited VRAM
Q3_K_M	3-bit	~32 GB	Noticeable	Very limited VRAM

Performance Optimization

▸Use --n-gpu-layers 99 to offload all layers to GPU
▸Set --ctx-size to match your needs — larger context = more VRAM
▸Use --flash-attn for faster attention computation
▸Enable --mlock to prevent swapping (if you have enough RAM)
▸Use --cont-batching for continuous batching in server mode
▸Experiment with --threads for CPU-bound inference

OpenAI-Compatible API

llama.cpp's server mode provides an OpenAI-compatible API, so you can use it with any client that supports OpenAI:

# Start the server
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Multi-Model Setup

For a production setup, run multiple models on different ports and use a router to direct requests:

▸Small model (7B) on port 8081 for fast, simple tasks
▸Medium model (32B) on port 8082 for balanced tasks
▸Large model (70B) on port 8083 for complex reasoning
▸Use a reverse proxy to route based on task complexity

ℹ InfoThe KCAI Desktop app (see AI Projects section) integrates directly with llama.cpp for local inference while also supporting cloud providers as fallback.