Llama: Inference code for language models (deprecated)
PyTorch inference for Meta's Llama language models.
Learn more about Llama
Llama is an inference framework for Meta's open-source language models, ranging from 7 billion to 70 billion parameters. It uses PyTorch and supports distributed inference through torchrun, allowing parallel execution across multiple GPUs. The codebase includes model loading utilities, tokenizer integration, and example scripts for chat completion tasks. The repository has been deprecated in favor of specialized downstream projects that handle model distribution, safety, tooling, and agentic systems separately.
Distributed inference support
Uses torchrun for multi-GPU inference with configurable model parallelism, allowing users to adjust nproc_per_node based on model size requirements.
Minimal reference implementation
Designed as a lightweight example rather than a comprehensive framework, with basic utilities for model loading and tokenization that can be extended or integrated into other systems.
Direct model access
Provides download scripts and integration with Hugging Face for accessing model weights and tokenizers after license approval, with support for multiple model variants.
from llama import Llama
generator = Llama.build(
ckpt_dir="llama-2-7b/",
tokenizer_path="tokenizer.model",
max_seq_len=128,
max_batch_size=4
)
prompts = ["The future of AI is"]
results = generator.text_completion(prompts, max_gen_len=64, temperature=0.6)
print(results[0]['generation'])Related Repositories
Discover similar tools and frameworks used by developers
Summarize
CLI and browser extension that generates summaries from URLs, files, videos, podcasts, and other media sources.
OpenClaw
Personal AI assistant that runs on your own devices and connects to messaging platforms like WhatsApp, Telegram, and Slack.
X Recommendation Algorithm
Open source implementation of X's recommendation algorithm for timeline and notification ranking.
Video2X
ML-powered video upscaling, frame interpolation, and restoration with multiple backend support.
MLX
Lazy-evaluated NumPy-like arrays optimized for Apple silicon.