Running local LLMs on a homelab with Ollama

Table of Contents

I’ve been running local LLMs on my homelab for a few months now. The setup is surprisingly practical — fast enough for real use, completely private, zero API costs. Here’s how I did it.

Why local LLMs?
#

Privacy — no prompts sent to OpenAI/Anthropic. Useful when working with internal configs, scripts, or customer data.
Cost — no per-token billing. Run as many queries as you want.
Availability — works offline, not subject to API rate limits or outages.
Customization — you can fine-tune models on your own data.

The tradeoff: you need decent hardware and a bit of setup time.

Hardware I’m using
#

GPU: NVIDIA RTX 3060 12GB (passthrough to Ubuntu VM)
Host: Intel N305 mini PC, 32GB RAM
ESXi 8.0 U2 with GPU passthrough configured

A GPU is not strictly required — Ollama runs on CPU too — but performance is dramatically better with one. On CPU only, LLaMA 3 8B takes ~10 seconds per token. With the 3060, it’s real-time.

Step 1 — GPU passthrough on ESXi 8
#

In vSphere Client:

Navigate to Host → Configure → Hardware → PCI Devices
Find your GPU, click Toggle Passthrough, reboot the host
On the VM: Edit Settings → Add Other Device → PCI Device → select your GPU
Set VM memory reservation to 100% (required for passthrough)

Important: passthrough disables the GPU for the ESXi console. Use IPMI/iDRAC or a second display adapter for host management.

Step 2 — Ubuntu VM setup
#

I use Ubuntu 24.04 Server. After install:

# Install NVIDIA drivers
sudo apt install nvidia-driver-550 -y

# Verify GPU is visible
nvidia-smi

Step 3 — Install Ollama
#

curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (LLaMA 3 8B is a good start)
ollama pull llama3

# Run it
ollama run llama3

That’s it. Ollama handles model management, serving, and the API.

Step 4 — Expose the API
#

Ollama serves a REST API on port 11434. I expose it inside my lab network (not to the internet):

# Edit the systemd service
sudo systemctl edit ollama

# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Now I can use the API from any machine on my network, including from Ansible playbooks and scripts.

Practical uses I’ve found
#

Summarizing logs: paste a 500-line vCenter log → ask “what’s wrong here?” → works surprisingly well.

Writing Ansible tasks: “Write an Ansible task to configure NTP on RHEL 8” → usually correct on first try.

Explaining configs: paste an NSX firewall ruleset → ask “explain what this allows” → great for audits.

Model recommendations
#

Model	Size	VRAM needed	Good for
llama3:8b	4.7 GB	6 GB	General tasks, fast
llama3:70b	40 GB	48 GB+	Complex reasoning (needs big GPU)
mistral:7b	4.1 GB	6 GB	Code generation
codellama:13b	7.4 GB	10 GB	Code only, better than base llama
phi3:mini	2.2 GB	3 GB	Lightweight, runs on CPU

For a 12GB GPU, llama3:8b or codellama:13b are the sweet spots.

The whole setup took me about 2 hours. GPU passthrough on ESXi is the fiddly part — everything after that is straightforward.

Why local LLMs?#

Hardware I’m using#

Step 1 — GPU passthrough on ESXi 8#

Step 2 — Ubuntu VM setup#

Step 3 — Install Ollama#

Step 4 — Expose the API#

Practical uses I’ve found#

Model recommendations#