Ollama: Run large language models locally
Go-based CLI for local LLM inference and management.
Learn more about ollama
Ollama is a Go-based command-line application that enables local execution and management of large language models on consumer-grade hardware. The system downloads pre-trained models, converts them to the GGUF quantized format for memory efficiency, and runs a local HTTP inference server that handles model loading and request processing. Users can customize model behavior through declarative configuration files called Modelfiles that specify parameters such as temperature settings, system prompts, and model weights without requiring code modifications. The architecture supports integration with external applications through its REST API, allowing Python and JavaScript clients to communicate with locally-hosted models. This design prioritizes privacy and offline capability by eliminating dependencies on cloud-based inference services while accepting the trade-off of reduced performance compared to distributed GPU clusters.
GGUF Format Support
Natively imports quantized GGUF and Safetensors models for efficient inference on consumer hardware. Reduces memory requirements by 4-8x compared to full-precision models while maintaining performance.
Modelfile Customization
Define parameters, system prompts, and configurations in declarative Modelfiles to create custom model variants. Build and version modified models locally without altering base weights.
Multi-Platform Distribution
Native installers for macOS and Windows, shell scripts for Linux, and official Docker images ensure consistent deployment. Run identical models across development laptops, servers, and containerized environments.
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama2',
'prompt': 'Why is the sky blue?',
'stream': False
})
print(response.json()['response'])Adds logprobs support to both Ollama and OpenAI-compatible APIs, fixes tool calling bugs, and enables opt-in Vulkan rendering.
- –Set OLLAMA_VULKAN=1 to enable Vulkan rendering; Ollama now prefers dedicated GPUs over integrated GPUs for scheduling.
- –Tool definitions now correctly omit 'required' field when unspecified and fix missing 'tool_call_id' in OpenAI API responses.
Adds embedding model support to `ollama run` and fixes critical hangs from CPU discovery; no breaking changes reported.
- –Run embedding models directly with `ollama run embeddinggemma "text"` or pipe input via stdin for vector generation.
- –Update if you hit CPU discovery hangs or need tool call IDs from `/api/chat`; fixes qwen3-vl:235b errors and stale VRAM reads.
Fixes a performance regression affecting CPU-only systems introduced in v0.12.8.
- –Upgrade if running on CPU-only hardware to restore prior inference performance.
- –Release notes do not specify the root cause or affected workloads beyond CPU systems.
See how people are using ollama
Related Repositories
Discover similar tools and frameworks used by developers
fastmcp
Build Model Context Protocol servers with decorators.
NAFNet
Efficient PyTorch architecture for image restoration tasks.
ByteTrack
Multi-object tracker associating low-confidence detections across frames.
CLIP
Multimodal zero-shot classifier using contrastive vision-language learning.
cai
LLM-powered Python framework for automated penetration testing workflows.