Skip to main content
  1. Posts/

Running local LLMs on a homelab with Ollama

Author
Szymon Leszega
Writing about what I test and deploy myself — vSphere, homelabs, cybersecurity hardening and AI in infrastructure. No theory, only things that work.

I’ve been running local LLMs on my homelab for a few months now. The setup is surprisingly practical — fast enough for real use, completely private, zero API costs. Here’s how I did it.

Why local LLMs?
#

  • Privacy — no prompts sent to OpenAI/Anthropic. Useful when working with internal configs, scripts, or customer data.
  • Cost — no per-token billing. Run as many queries as you want.
  • Availability — works offline, not subject to API rate limits or outages.
  • Customization — you can fine-tune models on your own data.

The tradeoff: you need decent hardware and a bit of setup time.

Hardware I’m using
#

  • GPU: NVIDIA RTX 3060 12GB (passthrough to Ubuntu VM)
  • Host: Intel N305 mini PC, 32GB RAM
  • ESXi 8.0 U2 with GPU passthrough configured

A GPU is not strictly required — Ollama runs on CPU too — but performance is dramatically better with one. On CPU only, LLaMA 3 8B takes ~10 seconds per token. With the 3060, it’s real-time.

Step 1 — GPU passthrough on ESXi 8
#

In vSphere Client:

  1. Navigate to Host → Configure → Hardware → PCI Devices
  2. Find your GPU, click Toggle Passthrough, reboot the host
  3. On the VM: Edit Settings → Add Other Device → PCI Device → select your GPU
  4. Set VM memory reservation to 100% (required for passthrough)

Important: passthrough disables the GPU for the ESXi console. Use IPMI/iDRAC or a second display adapter for host management.

Step 2 — Ubuntu VM setup
#

I use Ubuntu 24.04 Server. After install:

# Install NVIDIA drivers
sudo apt install nvidia-driver-550 -y

# Verify GPU is visible
nvidia-smi

Step 3 — Install Ollama
#

curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (LLaMA 3 8B is a good start)
ollama pull llama3

# Run it
ollama run llama3

That’s it. Ollama handles model management, serving, and the API.

Step 4 — Expose the API
#

Ollama serves a REST API on port 11434. I expose it inside my lab network (not to the internet):

# Edit the systemd service
sudo systemctl edit ollama

# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Now I can use the API from any machine on my network, including from Ansible playbooks and scripts.

Practical uses I’ve found
#

Summarizing logs: paste a 500-line vCenter log → ask “what’s wrong here?” → works surprisingly well.

Writing Ansible tasks: “Write an Ansible task to configure NTP on RHEL 8” → usually correct on first try.

Explaining configs: paste an NSX firewall ruleset → ask “explain what this allows” → great for audits.

Model recommendations
#

ModelSizeVRAM neededGood for
llama3:8b4.7 GB6 GBGeneral tasks, fast
llama3:70b40 GB48 GB+Complex reasoning (needs big GPU)
mistral:7b4.1 GB6 GBCode generation
codellama:13b7.4 GB10 GBCode only, better than base llama
phi3:mini2.2 GB3 GBLightweight, runs on CPU

For a 12GB GPU, llama3:8b or codellama:13b are the sweet spots.


The whole setup took me about 2 hours. GPU passthrough on ESXi is the fiddly part — everything after that is straightforward.