Close-up of a computer circuit board with many components.

Running Local AI in Your Homelab: GPU Setup for Private LLMs

AI 2026-03-04 · 5 min read local-ai ollama llm gpu homelab-ai llama
By HomeLab Starter Editorial TeamHome lab enthusiasts covering hardware setup, networking, and self-hosted services for home and small office environments.

Running AI models locally has become practical. A modern consumer GPU can run 7-billion-parameter models at usable speeds, and 70-billion-parameter models at slower but still workable speeds with enough VRAM. The appeal: complete privacy (no data leaves your network), no API costs, no rate limits, and the ability to run fine-tuned or uncensored models unavailable from cloud providers.

Photo by Edgar Cornejo on Unsplash

This guide covers setting up local AI inference in your homelab — hardware requirements, software stacks, and practical deployment patterns.

Ollama running a local LLM with GPU acceleration on a homelab server

Hardware Requirements

VRAM: The Primary Bottleneck

Local LLM performance is primarily constrained by GPU VRAM. The model must fit in VRAM — anything that spills to system RAM runs at CPU speed, which is 10-50x slower.

VRAM requirements by model size:

Model Size Quantization VRAM Required
7B Q4_K_M (4-bit) ~5-6 GB
7B Q8_0 (8-bit) ~8 GB
13B Q4_K_M ~9-10 GB
13B Q8_0 ~14 GB
34B Q4_K_M ~20 GB
70B Q4_K_M ~40 GB
70B Q8_0 ~75 GB

Quantization trades quality for memory efficiency. Q4_K_M is the practical sweet spot — minimal quality loss, 50% of the full-precision size.

GPU Recommendations by Budget

VRAM ≥ 8GB (runs 7B models well):

VRAM ≥ 16GB (runs 13B, handles 34B with quality tradeoffs):

VRAM ≥ 40GB (runs 70B models):

For most homelab use — code assistance, document Q&A, summarization — a 7B model on an RTX 3060 12GB is excellent. 13B models on 16GB GPUs produce noticeably better output for complex reasoning tasks.

CPU-Only Inference

Running without a GPU is possible via llama.cpp with CPU optimizations (AVX2, AVX-512). A modern 8-core CPU can run:

CPU inference is practical for offline summarization of long documents, batch processing, or testing — not for interactive chat.

Software Stack Options

Ollama is the easiest path to local LLM inference. It handles model downloads, GPU detection, and provides a simple API.

Installation on Linux:

curl -fsSL https://ollama.com/install.sh | sh

Pull and run a model:

# Pull Llama 3.1 8B (good general-purpose model):
ollama pull llama3.1:8b

# Interactive chat:
ollama run llama3.1:8b

# List downloaded models:
ollama list

API usage (compatible with OpenAI API format):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Explain VLAN segmentation briefly"}]
  }'

Ollama automatically detects NVIDIA GPUs (via CUDA) and AMD GPUs (via ROCm on Linux). On macOS, it uses Metal for M-series GPU acceleration.

Docker Deployment (NVIDIA)

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama-data:
  open-webui-data:

Open WebUI provides a ChatGPT-like interface for interacting with your local models.

Prerequisite: NVIDIA Container Toolkit must be installed:

# Ubuntu/Debian:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

AMD GPU Setup (ROCm)

AMD GPUs require ROCm for GPU acceleration. Support varies by model:

Supported (good ROCm support): RX 6000/7000 series, Instinct MI series Limited support: RX 5000 series (may need HSA_OVERRIDE_GFX_VERSION)

# Install ROCm (Ubuntu 22.04):
sudo apt install -y rocm-dev rocm-libs rocminfo

# Verify GPU detection:
rocminfo | grep "gfx"

# Ollama auto-detects ROCm:
ollama run llama3.1:8b

For AMD GPUs not officially supported, set the HSA override:

# Example for RX 5700 XT (gfx1010):
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama run llama3.1:8b

llama.cpp (Advanced)

llama.cpp is the underlying C++ inference engine that Ollama uses. Running it directly gives more control:

# Clone and build with CUDA support:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

# Download a GGUF model (Llama 3.1 8B Q4):
# From huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF

# Run inference:
./build/bin/llama-cli \
  -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --gpu-layers 99 \
  -p "What is a VLAN?"

--gpu-layers 99 offloads all layers to GPU. Reduce this number to split the model between GPU and CPU VRAM if needed.

Model Selection

Code assistance:

General chat/reasoning:

Embeddings (for RAG applications):

Pull via Ollama: ollama pull modelname:tag

Want more ai guides? Get guides like this in your inbox — HomeLab Starter delivers one free deep-dive every week.

Practical Homelab Use Cases

1. Code Assistant (Continue.dev + Ollama)

Continue.dev is a VS Code extension that connects to local Ollama models for code completion and chat:

// ~/.continue/config.json
{
  "models": [{
    "title": "Llama 3.1 8B",
    "provider": "ollama",
    "model": "llama3.1:8b"
  }],
  "tabAutocompleteModel": {
    "title": "DeepSeek Coder",
    "provider": "ollama",
    "model": "deepseek-coder-v2:16b"
  }
}

2. Document Q&A (RAG with Ollama + Chroma)

Use a retrieval-augmented generation (RAG) pipeline to query your homelab documentation:

Open WebUI natively supports RAG — upload PDFs, text files, or entire document libraries, then query them with natural language.

3. Automation with LangChain

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1:8b", base_url="http://your-server:11434")

response = llm.invoke("Summarize the key steps to configure WireGuard on OpenWrt")
print(response)

Performance Optimization

Context length tradeoffs: Longer context = more VRAM. For interactive chat, 4K context is sufficient. Only increase for long-document tasks.

Parallel requests: Ollama handles concurrent requests but each consumes additional VRAM. On 12GB, running two simultaneous 7B sessions may cause VRAM overflow.

Flash Attention: Ollama enables Flash Attention automatically when available (CUDA 11.8+). Reduces VRAM for long contexts.

Quantization selection:

Power Consumption

GPU inference draws significant power:

GPU Inference Power Idle Power
RTX 3060 12GB 130-170W 10-15W
RTX 3090 250-320W 15-20W
RTX 4090 350-450W 15-25W
AMD RX 7900 XTX 250-310W 15-20W

For a homelab server running inference occasionally, the power costs are manageable. For 24/7 inference, an always-on M-series Mac mini (20-40W total) may be more economical than an NVIDIA GPU workstation.

Getting Started

The fastest path:

  1. Check if your GPU is NVIDIA or AMD and its VRAM
  2. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  3. Pull a model: ollama pull llama3.1:8b
  4. Test it: ollama run llama3.1:8b
  5. Deploy Open WebUI via Docker for a proper interface
  6. Explore model variants based on your VRAM

Local AI inference has crossed the threshold from interesting experiment to practical daily tool for many homelab operators. A $300 used RTX 3090 and an afternoon of setup gets you a private, capable AI assistant with no recurring costs and no data leaving your network.

Get free weekly tips in your inbox. Subscribe to HomeLab Starter

More ai guides

One focused tutorial every week — no spam, unsubscribe anytime.

Opens Substack to confirm — no spam, unsubscribe anytime.

Before you go...

Get a free weekly guide from HomeLab Starter — one focused topic, delivered every week. No spam.

Opens Substack to confirm — no spam, unsubscribe anytime.