Llama: Inference code for language models
PyTorch inference for Meta's Llama language models.
Learn more about llama
Llama is an inference framework for Meta's open-source language models, ranging from 7 billion to 70 billion parameters. It uses PyTorch and supports distributed inference through torchrun, allowing parallel execution across multiple GPUs. The codebase includes model loading utilities, tokenizer integration, and example scripts for chat completion tasks. The repository has been deprecated in favor of specialized downstream projects that handle model distribution, safety, tooling, and agentic systems separately.
Distributed inference support
Uses torchrun for multi-GPU inference with configurable model parallelism, allowing users to adjust nproc_per_node based on model size requirements.
Minimal reference implementation
Designed as a lightweight example rather than a comprehensive framework, with basic utilities for model loading and tokenization that can be extended or integrated into other systems.
Direct model access
Provides download scripts and integration with Hugging Face for accessing model weights and tokenizers after license approval, with support for multiple model variants.
from llama import Llama
generator = Llama.build(
ckpt_dir="llama-2-7b/",
tokenizer_path="tokenizer.model",
max_seq_len=128,
max_batch_size=4
)
prompts = ["The future of AI is"]
results = generator.text_completion(prompts, max_gen_len=64, temperature=0.6)
print(results[0]['generation'])Related Repositories
Discover similar tools and frameworks used by developers
ComfyUI-Manager
Graphical package manager for ComfyUI custom nodes.
LLaMA-Factory
Parameter-efficient fine-tuning framework for 100+ LLMs.
tesseract
LSTM-based OCR engine supporting 100+ languages.
mlx
Lazy-evaluated NumPy-like arrays optimized for Apple silicon.
DeepSpeed
PyTorch library for training billion-parameter models efficiently.