Ollama: Run large language models locally
Go-based CLI for local LLM inference and management.
Learn more about Ollama
Ollama is a Go-based command-line application that enables local execution and management of large language models on consumer-grade hardware. The system downloads pre-trained models, converts them to the GGUF quantized format for memory efficiency, and runs a local HTTP inference server that handles model loading and request processing. Users can customize model behavior through declarative configuration files called Modelfiles that specify parameters such as temperature settings, system prompts, and model weights without requiring code modifications. The architecture supports integration with external applications through its REST API, allowing Python and JavaScript clients to communicate with locally-hosted models. This design prioritizes privacy and offline capability by eliminating dependencies on cloud-based inference services while accepting the trade-off of reduced performance compared to distributed GPU clusters.
GGUF Format Support
Natively imports quantized GGUF and Safetensors models for efficient inference on consumer hardware. Reduces memory requirements by 4-8x compared to full-precision models while maintaining performance.
Modelfile Customization
Define parameters, system prompts, and configurations in declarative Modelfiles to create custom model variants. Build and version modified models locally without altering base weights.
Multi-Platform Distribution
Native installers for macOS and Windows, shell scripts for Linux, and official Docker images ensure consistent deployment. Run identical models across development laptops, servers, and containerized environments.
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama2',
'prompt': 'Why is the sky blue?',
'stream': False
})
print(response.json()['response'])Bug fixes for ollama launch command including context limits, missing model downloads, and image handling.
- –Fixed context limits when running ollama launch droid
- –ollama launch will now download missing models instead of erroring
- –Fixed bug where ollama launch claude would cause context compaction when providing images
New models Qwen3-Coder-Next and GLM-OCR, enhanced ollama launch with arguments and subagent support.
- –Qwen3-Coder-Next: a coding-focused language model from Alibaba's Qwen team, optimized for agentic coding workflows and local development
- –GLM-OCR: GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture
- –ollama launch can now be provided arguments, for example ollama launch claude -- --resume
- –ollama launch will now work run subagents when using ollama launch claude
- –Ollama will now set context limits for a set of models when using ollama launch opencode
Improved OpenClaw integration with automatic onboarding flow when launching for the first time.
- –ollama launch openclaw will now enter the standard OpenClaw onboarding flow if this has not yet been completed
See how people are using Ollama
Related Repositories
Discover similar tools and frameworks used by developers
NAFNet
Efficient PyTorch architecture for image restoration tasks.
Prompt Engineering Guide
Guides, papers, and resources for prompt engineering, RAG, and AI agents.
Qwen
Alibaba Cloud's pretrained LLMs supporting Chinese/English with up to 32K context length.
Megatron-LM
Library for training large transformer models with distributed computing and GPU-optimized building blocks.
MediaPipe
Graph-based framework for streaming media ML pipelines.