Luigi: Python batch job pipeline orchestration
Build complex batch pipelines with dependency management.
Learn more about luigi
Luigi is a Python framework for building and orchestrating complex batch job pipelines with built-in dependency management and scheduling capabilities. It works by defining tasks as Python classes that declare their dependencies, outputs, and execution logic, which the framework then resolves into a directed acyclic graph to determine the correct execution order. The framework includes a central scheduler that coordinates task execution across workers, automatically handling task parallelization, failure recovery, and preventing redundant execution of tasks whose outputs already exist. Luigi provides a web-based visualization interface that displays the dependency graph and monitors pipeline execution status in real-time. The framework is designed for long-running batch processes and integrates commonly with data processing systems like Hadoop, supporting both local execution and distributed computing environments.
Declarative Dependency Resolution
Tasks define inputs and outputs as Python objects, allowing Luigi to automatically compute execution order and determine which tasks need to run. Eliminates manual workflow orchestration and prevents redundant task execution.
Atomic File Operations
File system abstractions for local and HDFS storage ensure atomic writes that complete fully or not at all. Prevents pipelines from entering corrupted states when failures occur mid-task, eliminating manual cleanup.
Integrated Hadoop Ecosystem
Built-in templates for MapReduce, Hive, Pig, and Spark jobs with native HDFS support. Run Hadoop workflows without external orchestration layers or custom integration code.
import luigi
class ProcessData(luigi.Task):
def output(self):
return luigi.LocalTarget('output.txt')
def run(self):
with self.output().open('w') as f:
f.write('Processing complete')
luigi.build([ProcessData()], local_scheduler=True)Drops Python 3.5 and 3.6 support; fixes multiple security issues including sensitive logging, file permissions, and tarfile extraction vulnerabilities.
- –Upgrade to Python 3.7+ before deploying; Python 3.5 and 3.6 are no longer supported.
- –Review pai.py, lock.py, lsf.py, and runner modules for patched security flaws affecting credentials and file handling.
Maintenance release updating Azure Blob Storage dependency to 12.x series and fixing batch email configuration documentation.
- –Upgrade azure.storage.blob to 12.x.y if using luigi.contrib.azureblob; verify compatibility with your Azure storage code.
- –Review batch email configuration docs for corrections; release notes do not specify other breaking changes or security fixes.
Maintenance release adding Python 3.12 support and fixing parameter handling, error messages, and SVG visualization bugs.
- –Upgrade to Python 3.12 if needed; this release officially supports it alongside existing versions.
- –Review TupleParameter usage; str-to-tuple conversion bug is fixed, and optional parameter execution summaries now display correctly.
Top in Data Engineering
Related Repositories
Discover similar tools and frameworks used by developers
airflow
Python platform for DAG-based task orchestration and scheduling.
n8n
Node-based automation platform with JavaScript and Python scripting.
docling
Fast document parser for RAG and AI workflows.
patroni
Automates PostgreSQL failover using distributed consensus systems.
supabase
PostgreSQL backend with auto-generated APIs and real-time subscriptions.