Llama: Inference code for language models (deprecated)
PyTorch inference for Meta's Llama language models.
Learn more about Llama
Llama is an inference framework for Meta's open-source language models, ranging from 7 billion to 70 billion parameters. It uses PyTorch and supports distributed inference through torchrun, allowing parallel execution across multiple GPUs. The codebase includes model loading utilities, tokenizer integration, and example scripts for chat completion tasks. The repository has been deprecated in favor of specialized downstream projects that handle model distribution, safety, tooling, and agentic systems separately.
Distributed inference support
Uses torchrun for multi-GPU inference with configurable model parallelism, allowing users to adjust nproc_per_node based on model size requirements.
Minimal reference implementation
Designed as a lightweight example rather than a comprehensive framework, with basic utilities for model loading and tokenization that can be extended or integrated into other systems.
Direct model access
Provides download scripts and integration with Hugging Face for accessing model weights and tokenizers after license approval, with support for multiple model variants.
from llama import Llama
generator = Llama.build(
ckpt_dir="llama-2-7b/",
tokenizer_path="tokenizer.model",
max_seq_len=128,
max_batch_size=4
)
prompts = ["The future of AI is"]
results = generator.text_completion(prompts, max_gen_len=64, temperature=0.6)
print(results[0]['generation'])Top in AI & ML
Related Repositories
Discover similar tools and frameworks used by developers
Model Context Protocol Servers
Reference implementations for LLM tool and data integration.
NAFNet
Efficient PyTorch architecture for image restoration tasks.
Continue
Multi-LLM coding agent with interactive and automated modes.
Ray
Unified framework for scaling AI and Python applications from laptops to clusters with distributed runtime.
Pica
Unified API platform connecting AI agents to 150+ integrations with auth and tool building.