I’ve been running local LLMs on my homelab for a few months now. The setup is surprisingly practical — fast enough for real use, completely private, zero API costs. Here’s how I did it.
Why local LLMs?#
- Privacy — no prompts sent to OpenAI/Anthropic. Useful when working with internal configs, scripts, or customer data.
- Cost — no per-token billing. Run as many queries as you want.
- Availability — works offline, not subject to API rate limits or outages.
- Customization — you can fine-tune models on your own data.
The tradeoff: you need decent hardware and a bit of setup time.
Hardware I’m using#
- GPU: NVIDIA RTX 3060 12GB (passthrough to Ubuntu VM)
- Host: Intel N305 mini PC, 32GB RAM
- ESXi 8.0 U2 with GPU passthrough configured
A GPU is not strictly required — Ollama runs on CPU too — but performance is dramatically better with one. On CPU only, LLaMA 3 8B takes ~10 seconds per token. With the 3060, it’s real-time.
Step 1 — GPU passthrough on ESXi 8#
In vSphere Client:
- Navigate to Host → Configure → Hardware → PCI Devices
- Find your GPU, click Toggle Passthrough, reboot the host
- On the VM: Edit Settings → Add Other Device → PCI Device → select your GPU
- Set VM memory reservation to 100% (required for passthrough)
Important: passthrough disables the GPU for the ESXi console. Use IPMI/iDRAC or a second display adapter for host management.
Step 2 — Ubuntu VM setup#
I use Ubuntu 24.04 Server. After install:
# Install NVIDIA drivers
sudo apt install nvidia-driver-550 -y
# Verify GPU is visible
nvidia-smiStep 3 — Install Ollama#
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (LLaMA 3 8B is a good start)
ollama pull llama3
# Run it
ollama run llama3That’s it. Ollama handles model management, serving, and the API.
Step 4 — Expose the API#
Ollama serves a REST API on port 11434. I expose it inside my lab network (not to the internet):
# Edit the systemd service
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"Now I can use the API from any machine on my network, including from Ansible playbooks and scripts.
Practical uses I’ve found#
Summarizing logs: paste a 500-line vCenter log → ask “what’s wrong here?” → works surprisingly well.
Writing Ansible tasks: “Write an Ansible task to configure NTP on RHEL 8” → usually correct on first try.
Explaining configs: paste an NSX firewall ruleset → ask “explain what this allows” → great for audits.
Model recommendations#
| Model | Size | VRAM needed | Good for |
|---|---|---|---|
| llama3:8b | 4.7 GB | 6 GB | General tasks, fast |
| llama3:70b | 40 GB | 48 GB+ | Complex reasoning (needs big GPU) |
| mistral:7b | 4.1 GB | 6 GB | Code generation |
| codellama:13b | 7.4 GB | 10 GB | Code only, better than base llama |
| phi3:mini | 2.2 GB | 3 GB | Lightweight, runs on CPU |
For a 12GB GPU, llama3:8b or codellama:13b are the sweet spots.
The whole setup took me about 2 hours. GPU passthrough on ESXi is the fiddly part — everything after that is straightforward.