tech

Ollama

Run LLMs locally - Llama, Mistral, and more on your machine

TL;DR

One-liner: Ollama lets you run open-source LLMs locally with a single command - no cloud, no API keys, complete privacy.

Core Value:

  • Privacy - your data never leaves your machine
  • No API costs - run models unlimited times after download
  • Offline capable - works without internet
  • Simple - one command to pull and run any model

Quick Start

Install

macOS: Download Ollama for Mac

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download Ollama for Windows

Verify

ollama --version

Run Your First Model

ollama run llama3.2

This downloads and starts Llama 3.2. Type a prompt and press Enter.

Chat Example

>>> What is the capital of France?
The capital of France is Paris.

>>> /bye

Cheatsheet

CommandDescription
ollama run MODELRun a model (downloads if needed)
ollama pull MODELDownload a model
ollama listList downloaded models
ollama psShow running models
ollama stop MODELStop a running model
ollama show MODELShow model details
ollama rm MODELDelete a model
ollama create NAME -f ModelfileCreate custom model
ollama serveStart Ollama server
ModelSizeCommand
Llama 3.22GBollama run llama3.2
Llama 3.3 70B43GBollama run llama3.3:70b
Mistral4GBollama run mistral
Gemma 33GBollama run gemma3
DeepSeek-R14GBollama run deepseek-r1
Phi 49GBollama run phi4
Qwen 2.54GBollama run qwen2.5

API Usage

Ollama exposes an OpenAI-compatible REST API on localhost:11434.

Generate Text

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello!"}],
  "stream": false
}'

Use with Python

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'Explain quantum computing briefly',
    'stream': False
})
print(response.json()['response'])

Custom Models (Modelfile)

Create a Modelfile to customize behavior:

FROM llama3.2

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9

# System prompt
SYSTEM """You are a helpful coding assistant.
Always provide clear explanations with examples."""

Build and run:

ollama create myassistant -f Modelfile
ollama run myassistant

Gotchas

Model download stuck or slow

Reason: Large files, network issues

Solution:

# Check download progress
ollama pull llama3.2 --verbose

# Use a smaller model first
ollama run gemma3:1b

Out of memory (OOM)

Reason: Model too large for available RAM

Solution:

  • Use smaller model variants (e.g., llama3.2:1b instead of llama3.2)
  • Close other applications
  • Memory requirements: 8GB for 7B models, 16GB for 13B, 32GB for 33B+

GPU not being used

Reason: Missing drivers or unsupported GPU

Solution:

# Check if GPU is detected
ollama run llama3.2 --verbose

# NVIDIA: Install CUDA drivers
# AMD: ROCm support on Linux only
# Apple Silicon: Metal used automatically

API connection refused

Reason: Ollama server not running

Solution:

# Start the server
ollama serve

# Or run a model (auto-starts server)
ollama run llama3.2

Model gives wrong or hallucinated answers

Reason: LLMs can hallucinate

Solution:

  • Use larger models for complex tasks
  • Add context in your prompts
  • Verify important information independently

Next Steps